It was possible to "block" the scheduler such that it would not
schedule or queue tasks for a dag if you triggered a DAG run when the
DAG was already at the max active runs.
This approach works around the problem for now, but a better longer term
fix for this would be to introduce a "queued" state for DagRuns, and
then when manually creating dag runs (or clearing) set it to queued, and
only have the scheduler set DagRuns to running, nothing else -- this
would mean we wouldn't need to examine active runs in the TI part of the
scheduler loop, only in DagRun creation part.
Fixes#11582
This was messing up the "max_active_runs" calculation, and this fix is a
"hack" until we add a better approach of adding a queued state to
DagRuns -- at which point we don't even have to do this calculation at
all.
This PR introduces creating_job_id column in DagRun table that links a
DagRun to job that created it. Part of #11302
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Although these lists are short, there's no need to re-create them each
time, and also no need for them to be a method.
I have made them lowercase (`finished`, `running`) instead of uppercase
(`FINISHED`, `RUNNING`) to distinguish them from the actual states.
* Spend less time waiting for LocalTaskJob's subprocss process to finish
This is about is about a 20% speed up for short running tasks!
This change doesn't affect the "duration" reported in the TI table, but
does affect the time before the slot is freeded up from the executor -
which does affect overall task/dag throughput.
(All these tests are with the same BashOperator tasks, just running `echo 1`.)
**Before**
```
Task airflow.executors.celery_executor.execute_command[5e0bb50c-de6b-4c78-980d-f8d535bbd2aa] succeeded in 6.597011625010055s: None
Task airflow.executors.celery_executor.execute_command[0a39ec21-2b69-414c-a11b-05466204bcb3] succeeded in 6.604327297012787s: None
```
**After**
```
Task airflow.executors.celery_executor.execute_command[57077539-e7ea-452c-af03-6393278a2c34] succeeded in 1.7728257849812508s: None
Task airflow.executors.celery_executor.execute_command[9aa4a0c5-e310-49ba-a1aa-b0760adfce08] succeeded in 1.7124666879535653s: None
```
**After, including change from #11372**
```
Task airflow.executors.celery_executor.execute_command[35822fc6-932d-4a8a-b1d5-43a8b35c52a5] succeeded in 0.5421732050017454s: None
Task airflow.executors.celery_executor.execute_command[2ba46c47-c868-4c3a-80f8-40adaf03b720] succeeded in 0.5469810889917426s: None
```
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.
This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.
* Fully support running more than one scheduler concurrently.
This PR implements scheduler HA as proposed in AIP-15. The high level
design is as follows:
- Move all scheduling decisions into SchedulerJob (requiring DAG
serialization in the scheduler)
- Use row-level locks to ensure schedulers don't stomp on each other
(`SELECT ... FOR UPDATE`)
- Use `SKIP LOCKED` for better performance when multiple schedulers are
running. (Mysql < 8 and MariaDB don't support this)
- Scheduling decisions are not tied to the parsing speed, but can
operate just on the database
*DagFileProcessorProcess*:
Previously this component was responsible for more than just parsing the
DAG files as it's name might imply. It also was responsible for creating
DagRuns, and also making scheduling decisions of TIs, sending them from
"None" to "scheduled" state.
This commit changes it so that the DagFileProcessorProcess now will
update the SerializedDAG row for this DAG, and make no scheduling
decisions itself.
To make the scheduler's job easier (so that it can make as many
decisions as possible without having to load the possibly-large
SerializedDAG row) we store/update some columns on the DagModel table:
- `next_dagrun`: The execution_date of the next dag run that should be created (or
None)
- `next_dagrun_create_after`: The earliest point at which the next dag
run can be created
Pre-computing these values (and updating them every time the DAG is
parsed) reduce the overall load on the DB as many decisions can be taken
by selecting just these two columns/the small DagModel row.
In case of max_active_runs, or `@once` these columns will be set to
null, meaning "don't create any dag runs"
*SchedulerJob*
The SchedulerJob used to only queue/send tasks to the executor after
they were parsed, and returned from the DagFileProcessorProcess.
This PR breaks the link between parsing and enqueuing of tasks, instead
of looking at DAGs as they are parsed, we now:
- store a new datetime column, `last_scheduling_decision` on DagRun
table, signifying when a scheduler last examined a DagRun
- Each time around the loop the scheduler will get (and lock) the next
_n_ DagRuns via `DagRun.next_dagruns_to_examine`, prioritising DagRuns
which haven't been touched by a scheduler in the longest period
- SimpleTaskInstance etc have been almost entirely removed now, as we
use the serialized versions
* Move callbacks execution from Scheduler loop to DagProcessorProcess
* Don’t run verify_integrity if the Serialized DAG hasn’t changed
dag_run.verify_integrity is slow, and we don't want to call it every time, just when the dag structure changes (which we can know now thanks to DAG Serialization)
* Add escape hatch to disable newly added "SELECT ... FOR UPDATE" queries
We are worried that these extra uses of row-level locking will cause
problems on MySQL 5.x (most likely deadlocks) so we are providing users
an "escape hatch" to be able to make these queries non-locking -- this
means that only a singe scheduler should be run, but being able to run
one is better than having the scheduler crash.
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
This PR allows for partial import error tracebacks to be exposed on the UI, if enabled. This extra context can be very helpful for users without access to the parsing logs to determine why their DAGs are failing to import properly.
This can happen when a task is enqueued by one executor, and then that
scheduler dies/exits.
The default fallback behaviour is unchanged -- that queued tasks are
cleared and then and then later rescheduled.
But for Celery, we can do better -- if we record the Celery-generated
task_id, we can then re-create the AsyncResult objects for orphaned
tasks at a later date.
However, since Celery just reports all AsyncResult as "PENDING", even if
they aren't tasks currently in the broker queue, we need to apply a
timeout to "unblock" these tasks in case they never actually made it to
the Celery broker.
This all means that we can adopt tasks that have been enqueued another
CeleryExecutor if it dies, without having to clear the task and slow
down. This is especially useful as the task may have already started
running, and while clearing it would stop it, it's better if we don't
have to reset it!
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
Once HA mode for scheduler lands, we can no longer reset orphaned
task by looking at the tasks in (the memory of) the current executor.
This changes it to keep track of which (Scheduler)Job queued/scheduled a
TaskInstance (the new "queued_by_job_id" column stored against
TaskInstance table), and then we can use the existing heartbeat
mechanism for jobs to notice when a TI should be reset.
As part of this the existing implementation of
`reset_state_for_orphaned_tasks` has been moved out of BaseJob in to
BackfillJob -- as only this and SchedulerJob had these methods, and the
SchedulerJob version now operates differently
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (https://github.com/apache/airflow/issues/10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.
There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.
Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.
Or simply those tests are more stable when run isolation.
We might still add failing tests back as soon we see them behave
in a flaky way.
The remaining quarantined tests that need to be fixed:
* test_local_run (often hangs the build)
* test_retry_handling_job
* test_clear_multiple_external_task_marker
* test_should_force_kill_process
* test_change_state_for_tis_without_dagrun
* test_cli_webserver_background
We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
* TestImpersonation tests
We might find that those heisentest can be fixed but for
now we are going to run them in isolation.
Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.
The `@provide_session` wrapper will already commit the transaction when
returned, unless an explicit session is passed in -- removing this
parameter changes the behaviour to be:
- If session explicitly passed in: don't commit (caller's
responsibility)
- If no session passed in, `@provide_session` will commit for us already.
Perf_kit was a separate folder and it was a problem when we tried to
build it from Docker-embedded sources, because there was a hidden,
implicit dependency between tests (conftest) and perf.
Perf_kit is now moved to tests to be avaiilable in the CI image
also when we run tests without the sources mounted.
This is changing back in #10441 and we need to move perf_kit
for it to work.
* Query TaskReschedule only if task is UP_FOR_RESCHEDULE
* Query for single TaskReschedule when possible
* Apply suggestions from code review
Co-authored-by: Stefan Seelmann <mail@stefan-seelmann.de>
* Adjust mocking in tests
* fixup! Adjust mocking in tests
* fixup! fixup! Adjust mocking in tests
Co-authored-by: Stefan Seelmann <mail@stefan-seelmann.de>
As part of the scheduler HA work we are going to want to separate the
parsing from the scheduling, so this changes the tests to ensure that
the important methods of DagFileProcessor can do everything the need to
when given a SerializedDAG, not just a DAG. i.e. that we have correctly
serialized all the necessary fields.
Both SchedulerJob and LocalTaskJob have their own timers and decide when
to call heartbeat based upon that. This makes those functions harder to
follow, (and the logs more confusing) so I've moved the logic to BaseJob
In debugging another test I noticed that the scheduler was spending a
long time waiting for a "simple" dag to be parsed. But upon closer
inspection the parsing process itself was done in a few milliseconds,
but we just weren't harvesting the results in a timely fashion.
This change uses the `sentinel` attribute of multiprocessing.Connection
(added in Python 3.3) to be able to wait for all the processes, so that
as soon as one has finished we get woken up and can immediately harvest
and pass on the parsed dags.
This makes test_scheduler_job.py about twice as quick, and also reduces
the time the scheduler spends between tasks .
In real work loads, or where there are lots of dags this likely won't
equate to much such a huge speed up, but for our (synthetic) elastic
performance test dag.
These were the timings for the dag to run all the tasks in a single dag
run to completion., with PERF_SCHEDULE_INTERVAL='1d' PERF_DAGS_COUNT=1
I also have
PERF_SHAPE=linear PERF_TASKS_COUNT=12:
**Before**: 45.4166s
**After**: 16.9499s
PERF_SHAPE=linear PERF_TASKS_COUNT=24:
**Before**: 82.6426s
**After**: 34.0672s
PERF_SHAPE=binary_tree PERF_TASKS_COUNT=24:
**Before**: 20.3802s
**After**: 9.1400s
PERF_SHAPE=grid PERF_TASKS_COUNT=24:
**Before**: 27.4735s
**After**: 11.5607s
If you have many more dag **files**, this likely won't be your bottleneck.
We have now mechanism to keep release notes updated for the
backport operators in an automated way.
It really nicely generates all the necessary information:
* summary of requirements for each backport package
* list of dependencies (including extras to install them) when package
depends on other providers packages
* table of new hooks/operators/sensors/protocols/secrets
* table of moved hooks/operators/sensors/protocols/secrets with
information where they were moved from
* changelog of all the changes to the provider package (this will be
automatically updated with incremental changelog whenever we decide to
release separate packages.
The system is fully automated - we will be able to produce release notes
automatically (per-package) whenever we decide to release new version of
the package in the future.
* Set conf vals as env vars so spawned process can access values.
* Create custom env_vars context manager to control simple environment variables.
* Use env_vars instead of conf_vars when using .
* When creating temporary environment variables, remove them if they didn't exist.
I would like to (create) and use a pytest fixture as a parameter, but
they cannot be used on unittest.TestCase functions:
> unittest.TestCase methods cannot directly receive fixture arguments as
> implementing that is likely to inflict on the ability to run general
> unittest.TestCase test suites.