incubator-airflow/docs/dag-serialization.rst

90 строки
3.5 KiB
ReStructuredText

.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
DAG Serialization
=================
In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.7 supports
DAG Serialization and DB Persistence.
.. image:: img/dag_serialization.png
Without DAG Serialization & persistence in DB, the Webserver and the Scheduler both
needs access to the DAG files. Both the scheduler and webserver parses the DAG files.
With **DAG Serialization** we aim to decouple the webserver from DAG parsing
which would make the Webserver very light-weight.
As shown in the image above, when using the this feature,
the Scheduler parses the DAG files, serializes them in JSON format and saves them in the Metadata DB.
The Webserver now instead of having to parse the DAG file again, reads the
serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
to show in the UI.
One of the key features that is implemented as the part of DAG Serialization is that
instead of loading an entire DagBag when the WebServer starts we only load each DAG on demand from the
Serialized Dag table. This helps reduce Webserver startup time and memory. The reduction is notable
when you have large number of DAGs.
Below is the screenshot of the ``serialized_dag`` table in Metadata DB:
.. image:: img/serialized_dag_table.png
Enable Dag Serialization
------------------------
Add the following settings in ``airflow.cfg``:
.. code-block:: ini
[core]
store_serialized_dags = True
min_serialized_dag_update_interval = 30
* ``store_serialized_dags``: This flag decides whether to serialises DAGs and persist them in DB.
If set to True, Webserver reads from DB instead of parsing DAG files
* ``min_serialized_dag_update_interval``: This flag sets the minimum interval (in seconds) after which
the serialized DAG in DB should be updated. This helps in reducing database write rate.
If you are updating Airflow from <1.10.7, please do not forget to run ``airflow db upgrade``.
Limitations
-----------
The Webserver will still need access to DAG files in the following cases,
which is why we said "almost" stateless.
* **Rendered Template** tab will still have to parse Python file as it needs all the details like
the execution date and even the data passed by the upstream task using Xcom.
* **Code View** will read the DAG File & show it using Pygments.
However, it does not need to Parse the Python file so it is still a small operation.
Using a different JSON Library
------------------------------
To use a different JSON library instead of the standard ``json`` library like ``ujson``, you need to
define a ``json`` variable in local Airflow settings (``airflow_local_settings.py``) file as follows:
.. code:: python
import ujson
json = ujson