There is a bug caused by scheduler_jobs refactor which leads to task failure
and scheduler locking.
Essentially when a there is an overflow of tasks going into the scheduler, the
tasks are set back to scheduled, but are not removed from the executor's
queued_tasks queue.
This means that the executor will attempt to run tasks that are in the scheduled
state, but those tasks will fail dependency checks. Eventually the queue is
filled with scheduled tasks, and the scheduler can no longer run.
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>, Kevin Yang <kevin.yang@airbnb.com>
The list of tests for autocomplete is now generated automatically when you enter Breeze.
It will take some 40 seconds or so to generate the list and until it's done there are
no autocompletions but they appear right after the list is ready.
* [AIRFLOW-5631] Change way of running GCP system tests
This commit proposes a new way of running GCP related system tests.
It uses SystemTests base class and authentication is provided by a
context manager thus it's easier to understand what's going on.
* [AIRFLOW-5580] Add base class for system test
This commit proposes base class for running system test in Airflow. The
main concepts is to create an example DAG and run it for test purpose. This
is especially important in case of integration with third party services.
Since we switched to using sub-processes to parse the DAG files sometime
back in 2016(!) the metrics we have been emitting about dag bag size and
parsing have been incorrect.
We have also been emitting metrics from the webserver which is going to
be become wrong as we move towards a stateless webserver.
To fix both of these issues I have stopped emitting the metrics from
models.DagBag and only emit them from inside the
DagFileProcessorManager.
(There was also a bug in the `dag.loading-duration.*` we were emitting
from the DagBag code where the "dag_file" part of that metric was empty.
I have fixed that even though I have now deprecated that metric. The
webserver was emitting the right metric though so many people wouldn't
notice)
`non_pooled_task_slot_count` and `non_pooled_backfill_task_slot_count`
are removed in favor of a real pool, e.g. `default_pool`.
By default tasks are running in `default_pool`.
`default_pool` is initialized with 128 slots and user can change the
number of slots through UI/CLI. `default_pool` cannot be removed.
This adds ASF license headers to all the .rst and .md files with the
exception of the Pull Request template (as that is included verbatim
when opening a Pull Request on Github which would be messy)
The different UtcDateTime implementations all have issues.
Either they replace tzinfo directly without converting
or they do not convert to UTC at all.
We also ensure all mysql connections are in UTC
in order to keep sanity, as mysql will ignore the
timezone of a field when inserting/updating.
When using impersonation via `run_as_user`, the
PYTHONPATH environment
variable is not propagated hence there may be
issues when depending on
specific custom packages used in DAGs.
This PR propagates only the PYTHONPATH in the
process creating the
sub-process with impersonation, if any.
Tested in staging environment; impersonation tests
in airflow are not very portable and fixing them
would take additional work, leaving as TODO and
tracking with jira ticket: https://issues.apache.o
rg/jira/browse/AIRFLOW-1901.
Closes#2860 from edgarRd/erod-
pythonpath_run_as_user
In all the popular languages the variable name log
is the de facto
standard for the logging. Rename LoggingMixin.py
to logging_mixin.py
to comply with the Python standard.
When using the .logger a deprecation warning will
be emitted.
Closes#2604 from Fokko/AIRFLOW-1604-logger-to-log
Here is the original PR with Max's LGTM:
https://github.com/aoen/incubator-airflow/pull/1
Since then I have made some fixes but this PR is essentially the same.
It could definitely use more eyes as there are likely still issues.
**Goals**
- Simplify, consolidate, and make consistent the logic of whether or not
a task should be run
- Provide a view/better logging that gives insight into why a task
instance is not currently running (no more viewing the scheduler logs
to find out why a task instance isn't running for the majority of
cases):
![image](https://cloud.githubusercontent.com/assets/1592778/17637621/aa669f5e-6099-11e6-81c2-d988d2073aac.png)
**Notable Functional Changes**
- Webserver view + task_failing_deps CLI command to explain why a given
task instance isn't being run by the scheduler
- Running a backfill in the command line and running a task in the UI
will now display detailed error messages based on which dependencies
were not met for a task instead of appearing to succeed but actually
failing silently
- Maximum task concurrency and pools are not respected by backfills
- Backfill now has the equivalent of the old force flag to run even for
successful tasks
This will break one use case:
Using pools to restrict some resource on airflow executors themselves
(rather than an external resource like a DB), e.g. some task uses 60%
of cpu on a worker so we restrict that task's pool size to 1 to
prevent two of the tasks from running on the same host. When
backfilling a task of this type, now the backfill will wait on the
pool to have slots open up before running the task even though we
don't need to do this if backfilling on a different host outside of
the pool. I think breaking this use case is OK since the use case is a
hack due to not having a proper resource isolation solution (e.g.
mesos should be used in this case instead).
- To make things less confusing for users, there is now a "ignore all
dependencies" option for running tasks, "ignore dependencies" has been
renamed to "ignore task dependencies", and "force" has been renamed to
"ignore task instance state". The new "Ignore all dependencies" flag
will ignore the following:
- task instance's pool being full
- execution date for a task instance being in the future
- a task instance being in the retry waiting period
- the task instance's task ending prior to the task instance's
execution date
- task instance is already queued
- task instance has already completed
- task instance is in the shutdown state
- WILL NOT IGNORE task instance is already running
- SLA miss emails will now include all tasks that did not finish for a
particular DAG run, even if
the tasks didn't run because depends_on_past was not met for a task
- Tasks with pools won't get queued automatically the first time they
reach a worker; if they are ready to run they will be run immediately
- Running a task via the UI or via the command line (backfill/run
commands) will now log why a task could not get run if one if it's
dependencies isn't met. For tasks kicked off via the web UI this
means that tasks don't silently fail to get queued despite a
successful message in the UI.
- Queuing a task into a pool that doesn't exist will now get stopped in
the scheduler instead of a worker
**Follow Up Items**
- Update the docs to reference the new explainer views/CLI command
Closes#1729 from aoen/ddavydov/blockedTIExplainerRebasedMaster