incubator-airflow/UPDATING.md

# Updating Airflow

This file documents any backwards-incompatible changes in Airflow and
assists people when migrating to a new version.

## Master

### New Features

#### Dask Executor

A new DaskExecutor allows Airflow tasks to be run in Dask Distributed clusters.

## Airflow 1.8

### Database
The database schema needs to be upgraded. Make sure to shutdown Airflow and make a backup of your database. To
upgrade the schema issue `airflow upgradedb`.

### Upgrade systemd unit files
Systemd unit files have been updated. If you use systemd please make sure to update these.

> Please note that the webserver does not detach properly, this will be fixed in a future version.

### Tasks not starting although dependencies are met due to stricter pool checking
Airflow 1.7.1 has issues with being able to over subscribe to a pool, ie. more slots could be used than were
available. This is fixed in Airflow 1.8.0, but due to past issue jobs may fail to start although their
dependencies are met after an upgrade. To workaround either temporarily increase the amount of slots above
the the amount of queued tasks or use a new pool.

### Less forgiving scheduler on dynamic start_date
Using a dynamic start_date (e.g. `start_date = datetime.now()`) is not considered a best practice. The 1.8.0 scheduler
is less forgiving in this area. If you encounter DAGs not being scheduled you can try using a fixed start_date and
renaming your dag. The last step is required to make sure you start with a clean slate, otherwise the old schedule can
interfere.

### New and updated scheduler options
Please read through these options, defaults have changed since 1.7.1.

#### child_process_log_directory
In order the increase the robustness of the scheduler, DAGS our now processed in their own process. Therefore each 
DAG has its own log file for the scheduler. These are placed in `child_process_log_directory` which defaults to 
`<AIRFLOW_HOME>/scheduler/latest`. You will need to make sure these log files are removed.

> DAG logs or processor logs ignore and command line settings for log file locations.

#### run_duration
Previously the command line option `num_runs` was used to let the scheduler terminate after a certain amount of
loops. This is now time bound and defaults to `-1`, which means run continuously. See also num_runs.

#### num_runs
Previously `num_runs` was used to let the scheduler terminate after a certain amount of loops. Now num_runs specifies 
the number of times to try to schedule each DAG file within `run_duration` time. Defaults to `-1`, which means try
indefinitely. This is only available on the command line.

#### min_file_process_interval
After how much time should an updated DAG be picked up from the filesystem.

#### dag_dir_list_interval
How often the scheduler should relist the contents of the DAG directory. If you experience that while developing your
dags are not being picked up, have a look at this number and decrease it when necessary.

#### catchup_by_default
By default the scheduler will fill any missing interval DAG Runs between the last execution date and the current date.
This setting changes that behavior to only execute the latest interval. This can also be specified per DAG as 
`catchup = False / True`. Command line backfills will still work.

### Faulty Dags do not show an error in the Web UI

Due to changes in the way Airflow processes DAGs the Web UI does not show an error when processing a faulty DAG. To
find processing errors go the `child_process_log_directory` which defaults to `<AIRFLOW_HOME>/scheduler/latest`.

### New DAGs are paused by default

Previously, new DAGs would be scheduled immediately. To retain the old behavior, add this to airflow.cfg:

```
[core]
dags_are_paused_at_creation = False
```

### Airflow Context variable are passed to Hive config if conf is specified

If you specify a hive conf to the run_cli command of the HiveHook, Airflow add some
convenience variables to the config. In case your run a sceure Hadoop setup it might be
required to whitelist these variables by adding the following to your configuration:

```
<property> 
     <name>hive.security.authorization.sqlstd.confwhitelist.append</name>
     <value>airflow\.ctx\..*</value>
</property>
```
### Google Cloud Operator and Hook alignment

All Google Cloud Operators and Hooks are aligned and use the same client library. Now you have a single connection 
type for all kinds of Google Cloud Operators.

If you experience problems connecting with your operator make sure you set the connection type "Google Cloud Platform".

Also the old P12 key file type is not supported anymore and only the new JSON key files are supported as a service 
account.

### Deprecated Features
These features are marked for deprecation. They may still work (and raise a `DeprecationWarning`), but are no longer 
supported and will be removed entirely in Airflow 2.0

- Hooks and operators must be imported from their respective submodules

  `airflow.operators.PigOperator` is no longer supported; `from airflow.operators.pig_operator import PigOperator` is. 
  (AIRFLOW-31, AIRFLOW-200)

- Operators no longer accept arbitrary arguments

  Previously, `Operator.__init__()` accepted any arguments (either positional `*args` or keyword `**kwargs`) without 
  complaint. Now, invalid arguments will be rejected. (https://github.com/apache/incubator-airflow/pull/1285)

### Known Issues
There is a report that the default of "-1" for num_runs creates an issue where errors are reported while parsing tasks.
It was not confirmed, but a workaround was found by changing the default back to `None`.

To do this edit `cli.py`, find the following:

```
        'num_runs': Arg(
            ("-n", "--num_runs"),
            default=-1, type=int,
            help="Set the number of runs to execute before exiting"),
```

and change `default=-1` to `default=None`. Please report on the mailing list if you have this issue.

## Airflow 1.7.1.2

### Changes to Configuration

#### Email configuration change

To continue using the default smtp email backend, change the email_backend line in your config file from:

```
[email]
email_backend = airflow.utils.send_email_smtp
```
to:
```
[email]
email_backend = airflow.utils.email.send_email_smtp
```

#### S3 configuration change

To continue using S3 logging, update your config file so:

```
s3_log_folder = s3://my-airflow-log-bucket/logs
```
becomes:
```
remote_base_log_folder = s3://my-airflow-log-bucket/logs
remote_log_conn_id = <your desired s3 connection>
```
Set dags_are_paused_at_creation's default value to True 2016-03-31 00:32:18 +03:00			`# Updating Airflow`

Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00			`This file documents any backwards-incompatible changes in Airflow and`
			`assists people when migrating to a new version.`
Set dags_are_paused_at_creation's default value to True 2016-03-31 00:32:18 +03:00
[AIRFLOW-862] Add DaskExecutor Adds a DaskExecutor for running Airflow tasks in Dask clusters. Closes #2067 from jlowin/dask-executor 2017-02-13 00:06:31 +03:00			`## Master`

			`### New Features`

			`#### Dask Executor`

			`A new DaskExecutor allows Airflow tasks to be run in Dask Distributed clusters.`

Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00			`## Airflow 1.8`
Set dags_are_paused_at_creation's default value to True 2016-03-31 00:32:18 +03:00
[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`### Database`
			`The database schema needs to be upgraded. Make sure to shutdown Airflow and make a backup of your database. To`
			upgrade the schema issue `airflow upgradedb`.

			`### Upgrade systemd unit files`
			`Systemd unit files have been updated. If you use systemd please make sure to update these.`

			`> Please note that the webserver does not detach properly, this will be fixed in a future version.`

Add pool upgrade issue description 2017-02-09 18:10:17 +03:00			`### Tasks not starting although dependencies are met due to stricter pool checking`
			`Airflow 1.7.1 has issues with being able to over subscribe to a pool, ie. more slots could be used than were`
			`available. This is fixed in Airflow 1.8.0, but due to past issue jobs may fail to start although their`
			`dependencies are met after an upgrade. To workaround either temporarily increase the amount of slots above`
			`the the amount of queued tasks or use a new pool.`

[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`### Less forgiving scheduler on dynamic start_date`
			Using a dynamic start_date (e.g. `start_date = datetime.now()`) is not considered a best practice. The 1.8.0 scheduler
			`is less forgiving in this area. If you encounter DAGs not being scheduled you can try using a fixed start_date and`
			`renaming your dag. The last step is required to make sure you start with a clean slate, otherwise the old schedule can`
			`interfere.`

			`### New and updated scheduler options`
			`Please read through these options, defaults have changed since 1.7.1.`

			`#### child_process_log_directory`
			`In order the increase the robustness of the scheduler, DAGS our now processed in their own process. Therefore each`
			DAG has its own log file for the scheduler. These are placed in `child_process_log_directory` which defaults to
			`<AIRFLOW_HOME>/scheduler/latest`. You will need to make sure these log files are removed.

			`> DAG logs or processor logs ignore and command line settings for log file locations.`

			`#### run_duration`
			Previously the command line option `num_runs` was used to let the scheduler terminate after a certain amount of
			loops. This is now time bound and defaults to `-1`, which means run continuously. See also num_runs.

			`#### num_runs`
			Previously `num_runs` was used to let the scheduler terminate after a certain amount of loops. Now num_runs specifies
			the number of times to try to schedule each DAG file within `run_duration` time. Defaults to `-1`, which means try
Add known issue of 'num_runs' 2017-02-10 16:53:02 +03:00			`indefinitely. This is only available on the command line.`
Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00
[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`#### min_file_process_interval`
			`After how much time should an updated DAG be picked up from the filesystem.`

			`#### dag_dir_list_interval`
			`How often the scheduler should relist the contents of the DAG directory. If you experience that while developing your`
			`dags are not being picked up, have a look at this number and decrease it when necessary.`

			`#### catchup_by_default`
			`By default the scheduler will fill any missing interval DAG Runs between the last execution date and the current date.`
			`This setting changes that behavior to only execute the latest interval. This can also be specified per DAG as`
			`catchup = False / True`. Command line backfills will still work.

			`### Faulty Dags do not show an error in the Web UI`

			`Due to changes in the way Airflow processes DAGs the Web UI does not show an error when processing a faulty DAG. To`
			find processing errors go the `child_process_log_directory` which defaults to `<AIRFLOW_HOME>/scheduler/latest`.

			`### New DAGs are paused by default`
Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00
			`Previously, new DAGs would be scheduled immediately. To retain the old behavior, add this to airflow.cfg:`
Set dags_are_paused_at_creation's default value to True 2016-03-31 00:32:18 +03:00
			```
Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00			`[core]`
Set dags_are_paused_at_creation's default value to True 2016-03-31 00:32:18 +03:00			`dags_are_paused_at_creation = False`
			```

[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`### Airflow Context variable are passed to Hive config if conf is specified`

			`If you specify a hive conf to the run_cli command of the HiveHook, Airflow add some`
			`convenience variables to the config. In case your run a sceure Hadoop setup it might be`
			`required to whitelist these variables by adding the following to your configuration:`

			```
			`<property>`
			`<name>hive.security.authorization.sqlstd.confwhitelist.append</name>`
			`<value>airflow\.ctx\..*</value>`
			`</property>`
			```
			`### Google Cloud Operator and Hook alignment`
Update upgrade documentation for Google Cloud Closes #1979 from alexvanboxel/pr/doc_gcloud 2017-01-10 11:03:37 +03:00
[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`All Google Cloud Operators and Hooks are aligned and use the same client library. Now you have a single connection`
			`type for all kinds of Google Cloud Operators.`
Update upgrade documentation for Google Cloud Closes #1979 from alexvanboxel/pr/doc_gcloud 2017-01-10 11:03:37 +03:00
			`If you experience problems connecting with your operator make sure you set the connection type "Google Cloud Platform".`

[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`Also the old P12 key file type is not supported anymore and only the new JSON key files are supported as a service`
			`account.`
Update upgrade documentation for Google Cloud Closes #1979 from alexvanboxel/pr/doc_gcloud 2017-01-10 11:03:37 +03:00
Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00			`### Deprecated Features`
[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			These features are marked for deprecation. They may still work (and raise a `DeprecationWarning`), but are no longer
			`supported and will be removed entirely in Airflow 2.0`
Deprecate args and *kwargs in BaseOperator BaseOperator silently accepts any arguments. This deprecates the behavior with a warning that says it will be forbidden in Airflow 2.0. This PR also turns on DeprecationWarnings by default, which in turn revealed that inspect.getargspec is deprecated. Here it is replaced by `inspect.signature` (Python 3) or `funcsigs.signature` (Python 2). Lastly, this brought to attention that example_http_operator was passing an illegal argument. 2016-04-05 11:04:55 +03:00
[AIRFLOW-31][AIRFLOW-200] Add note to updating.md AIRFLOW-31 and AIRFLOW-200 deprecated the old important mechanism and should be noted in UPDATING.md Closes #1643 from jlowin/patch-1 2016-07-06 11:41:46 +03:00			`- Hooks and operators must be imported from their respective submodules`

[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			`airflow.operators.PigOperator` is no longer supported; `from airflow.operators.pig_operator import PigOperator` is.
			`(AIRFLOW-31, AIRFLOW-200)`
[AIRFLOW-31][AIRFLOW-200] Add note to updating.md AIRFLOW-31 and AIRFLOW-200 deprecated the old important mechanism and should be noted in UPDATING.md Closes #1643 from jlowin/patch-1 2016-07-06 11:41:46 +03:00
			`- Operators no longer accept arbitrary arguments`

[AIRFLOW-789] Update UPDATING.md Closes #2011 from bolkedebruin/AIRFLOW-789 2017-02-01 18:52:45 +03:00			Previously, `Operator.__init__()` accepted any arguments (either positional `args` or keyword `*kwargs`) without
			`complaint. Now, invalid arguments will be rejected. (https://github.com/apache/incubator-airflow/pull/1285)`
[AIRFLOW-171] Add upgrade notes on email and S3 to 1.7.1.2 Closes #1587 from rfroetscher/upgrading_readme 2016-06-14 13:27:58 +03:00
Add known issue of 'num_runs' 2017-02-10 16:53:02 +03:00			`### Known Issues`
			`There is a report that the default of "-1" for num_runs creates an issue where errors are reported while parsing tasks.`
			It was not confirmed, but a workaround was found by changing the default back to `None`.

			To do this edit `cli.py`, find the following:

			```
			`'num_runs': Arg(`
			`("-n", "--num_runs"),`
			`default=-1, type=int,`
			`help="Set the number of runs to execute before exiting"),`
			```

			and change `default=-1` to `default=None`. Please report on the mailing list if you have this issue.

[AIRFLOW-171] Add upgrade notes on email and S3 to 1.7.1.2 Closes #1587 from rfroetscher/upgrading_readme 2016-06-14 13:27:58 +03:00			`## Airflow 1.7.1.2`

			`### Changes to Configuration`

			`#### Email configuration change`

			`To continue using the default smtp email backend, change the email_backend line in your config file from:`

			```
			`[email]`
			`email_backend = airflow.utils.send_email_smtp`
			```
			`to:`
			```
			`[email]`
			`email_backend = airflow.utils.email.send_email_smtp`
			```

			`#### S3 configuration change`

			`To continue using S3 logging, update your config file so:`

			```
			`s3_log_folder = s3://my-airflow-log-bucket/logs`
			```
			`becomes:`
			```
			`remote_base_log_folder = s3://my-airflow-log-bucket/logs`
			`remote_log_conn_id = <your desired s3 connection>`
			```