diff --git a/docs/installation.rst b/docs/installation.rst index 4ce0917f2e..351b58353b 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -154,14 +154,31 @@ variables and connections. Scaling Out with Celery ''''''''''''''''''''''' -CeleryExecutor is the way you can scale out the number of workers. For this +``CeleryExecutor`` is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (**RabbitMQ**, **Redis**, ...) and change your ``airflow.cfg`` to point the executor parameter to -CeleryExecutor and provide the related Celery settings. +``CeleryExecutor`` and provide the related Celery settings. For more information about setting up a Celery broker, refer to the exhaustive `Celery documentation on the topic `_. +Here are a few imperative requirements for your workers: + +- ``airflow`` needs to be installed, and the CLI needs to be in the path +- Airflow configuration settings should be homogeneous across the cluster +- Operators that are executed on the worker need to have their dependencies + met in that context. For example, if you use the ``HiveOperator``, + the hive CLI needs to be installed on that box, or if you use the + ``MySqlOperator``, the required Python library needs to be available in + the ``PYTHONPATH`` somehow +- The worker needs to have access to its ``DAGS_FOLDER``, and you need to + synchronize the filesystems by your own mean. A common setup would be to + store your DAGS_FOLDER in a Git repository and sync it across machines using + Chef, Puppet, Ansible, or whatever you use to configure machines in your + environment. If all your boxes have a common mount point, having your + pipelines files shared there should work as well + + To kick off a worker, you need to setup Airflow and kick off the worker subcommand @@ -173,13 +190,19 @@ Your worker should start picking up tasks as soon as they get fired in its direction. Note that you can also run "Celery Flower", a web UI built on top of Celery, -to monitor your workers. +to monitor your workers. You can use the shortcut command ``airflow flower`` +to start a Flower web server. + Logs '''' -Users can specify a logs folder in ``airflow.cfg``. By default, it is in the ``AIRFLOW_HOME`` directory. +Users can specify a logs folder in ``airflow.cfg``. By default, it is in +the ``AIRFLOW_HOME`` directory. -In addition, users can supply an S3 location for storing log backups. If logs are not found in the local filesystem (for example, if a worker is lost or reset), the S3 logs will be displayed in the Airflow UI. Note that logs are only sent to S3 once a task completes (including failure). +In addition, users can supply an S3 location for storing log backups. If +logs are not found in the local filesystem (for example, if a worker is +lost or reset), the S3 logs will be displayed in the Airflow UI. Note that +logs are only sent to S3 once a task completes (including failure). .. code-block:: bash @@ -189,11 +212,11 @@ In addition, users can supply an S3 location for storing log backups. If logs ar Scaling Out on Mesos (community contributed) '''''''''''''''''''''''''''''''''''''''''''' -MesosExecutor allows you to schedule airflow tasks on a Mesos cluster. +``MesosExecutor`` allows you to schedule airflow tasks on a Mesos cluster. For this to work, you need a running mesos cluster and you must perform the following steps - -1. Install airflow on a machine where webserver and scheduler will run, +1. Install airflow on a machine where web server and scheduler will run, let's refer to this as the "Airflow server". 2. On the Airflow server, install mesos python eggs from `mesos downloads `_. 3. On the Airflow server, use a database (such as mysql) which can be accessed from mesos