* Spend less time waiting for LocalTaskJob's subprocss process to finish
This is about is about a 20% speed up for short running tasks!
This change doesn't affect the "duration" reported in the TI table, but
does affect the time before the slot is freeded up from the executor -
which does affect overall task/dag throughput.
(All these tests are with the same BashOperator tasks, just running `echo 1`.)
**Before**
```
Task airflow.executors.celery_executor.execute_command[5e0bb50c-de6b-4c78-980d-f8d535bbd2aa] succeeded in 6.597011625010055s: None
Task airflow.executors.celery_executor.execute_command[0a39ec21-2b69-414c-a11b-05466204bcb3] succeeded in 6.604327297012787s: None
```
**After**
```
Task airflow.executors.celery_executor.execute_command[57077539-e7ea-452c-af03-6393278a2c34] succeeded in 1.7728257849812508s: None
Task airflow.executors.celery_executor.execute_command[9aa4a0c5-e310-49ba-a1aa-b0760adfce08] succeeded in 1.7124666879535653s: None
```
**After, including change from #11372**
```
Task airflow.executors.celery_executor.execute_command[35822fc6-932d-4a8a-b1d5-43a8b35c52a5] succeeded in 0.5421732050017454s: None
Task airflow.executors.celery_executor.execute_command[2ba46c47-c868-4c3a-80f8-40adaf03b720] succeeded in 0.5469810889917426s: None
```
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.
This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.
* KubernetesPodOperator can retry log tailing in case of interruption
* fix failing test
* change read_pod_logs method formatting
* KubernetesPodOperator retry log tailing based on last read log timestamp
* fix test_parse_log_line test formatting
* add docstring to parse_log_line method
* fix kubernetes integration test
The custom ClusterPolicyViolation has been added in #10282
This one adds more comprehensive test to it.
Co-authored-by: Jacob Ferriero <jferriero@google.com>
* Fully support running more than one scheduler concurrently.
This PR implements scheduler HA as proposed in AIP-15. The high level
design is as follows:
- Move all scheduling decisions into SchedulerJob (requiring DAG
serialization in the scheduler)
- Use row-level locks to ensure schedulers don't stomp on each other
(`SELECT ... FOR UPDATE`)
- Use `SKIP LOCKED` for better performance when multiple schedulers are
running. (Mysql < 8 and MariaDB don't support this)
- Scheduling decisions are not tied to the parsing speed, but can
operate just on the database
*DagFileProcessorProcess*:
Previously this component was responsible for more than just parsing the
DAG files as it's name might imply. It also was responsible for creating
DagRuns, and also making scheduling decisions of TIs, sending them from
"None" to "scheduled" state.
This commit changes it so that the DagFileProcessorProcess now will
update the SerializedDAG row for this DAG, and make no scheduling
decisions itself.
To make the scheduler's job easier (so that it can make as many
decisions as possible without having to load the possibly-large
SerializedDAG row) we store/update some columns on the DagModel table:
- `next_dagrun`: The execution_date of the next dag run that should be created (or
None)
- `next_dagrun_create_after`: The earliest point at which the next dag
run can be created
Pre-computing these values (and updating them every time the DAG is
parsed) reduce the overall load on the DB as many decisions can be taken
by selecting just these two columns/the small DagModel row.
In case of max_active_runs, or `@once` these columns will be set to
null, meaning "don't create any dag runs"
*SchedulerJob*
The SchedulerJob used to only queue/send tasks to the executor after
they were parsed, and returned from the DagFileProcessorProcess.
This PR breaks the link between parsing and enqueuing of tasks, instead
of looking at DAGs as they are parsed, we now:
- store a new datetime column, `last_scheduling_decision` on DagRun
table, signifying when a scheduler last examined a DagRun
- Each time around the loop the scheduler will get (and lock) the next
_n_ DagRuns via `DagRun.next_dagruns_to_examine`, prioritising DagRuns
which haven't been touched by a scheduler in the longest period
- SimpleTaskInstance etc have been almost entirely removed now, as we
use the serialized versions
* Move callbacks execution from Scheduler loop to DagProcessorProcess
* Don’t run verify_integrity if the Serialized DAG hasn’t changed
dag_run.verify_integrity is slow, and we don't want to call it every time, just when the dag structure changes (which we can know now thanks to DAG Serialization)
* Add escape hatch to disable newly added "SELECT ... FOR UPDATE" queries
We are worried that these extra uses of row-level locking will cause
problems on MySQL 5.x (most likely deadlocks) so we are providing users
an "escape hatch" to be able to make these queries non-locking -- this
means that only a singe scheduler should be run, but being able to run
one is better than having the scheduler crash.
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
This is similar to #11327, but for Celery this time.
The impact is not quite as pronounced here (for simple dags at least)
but takes the average queued to start delay from 1.5s to 0.4s
Spawning a whole new python process and then re-loading all of Airflow
is expensive. All though this time fades to insignificance for long
running tasks, this delay gives a "bad" experience for new users when
they are just trying out Airflow for the first time.
For the LocalExecutor this cuts the "queued time" down from 1.5s to 0.1s
on average.
* Add type annotations to ZendeskHook
__What__
* Add correct type annotations to ZendeskHook and each method
* Update one unit test to call an empty dictionary rather than a
NoneType since the argument should be a dictionary
__Why__
* Building out type annotations is good for the code base
* The query parameter is accessed with an index at one point, which
means that it cannot be a None type, but should rather be defaulted to
an empty dictionary if not provided
* Remove useless return
We don't currently create TIs form serialized dags, but we are about to
start -- at which point some of these cases would have just shown
"SerializedBaseOperator", rather than the _real_ class name.
The other changes are just for "consistency" -- we should always get the
task type from this property, not via `__class__.__name__`.
I haven't set up a pre-commit rule for this as using this dunder
accessor is used elsewhere on things that are not BaseOperator
instances, and detecting that is hard to do in a pre-commit rule.
Resolves#10953.
A refreshed UI for the 2.0 release. The existing "theming" is a bit long in the tooth and this PR attempts to give it a modern look and some freshness to compliment all of the new features under the hood.
The majority of the changes to UI have been done through updates to the Bootstrap theme contained in bootstrap-theme.css. These are simply overrides to the default stylings that are packaged with Bootstrap.
This PR allows for partial import error tracebacks to be exposed on the UI, if enabled. This extra context can be very helpful for users without access to the parsing logs to determine why their DAGs are failing to import properly.
* Fixes an issue where cycle detection uses recursion
and stack overflows after about 1000 tasks
(cherry picked from commit 63f1a180a17729aa937af642cfbf4ddfeccd1b9f)
* reduce test length
* slightly more efficient
* Update airflow/utils/dag_cycle_tester.py
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
* slightly more efficient
* actually works this time
Co-authored-by: Daniel Imberman <daniel@astronomer.io>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Example output (I forced one of the existing tests to fail)
```
E AssertionError: The expected number of db queries is 3. The current number is 2.
E
E Recorded query locations:
E scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_emit_pool_metrics>pool.py:slots_stats:94: 1
E scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_emit_pool_metrics>pool.py:slots_stats:101: 1
```
This makes it a bit easier to see what the queries are, without having
to re-run with full query tracing and then analyze the logs.
This can have *extremely* bad consequences. After this change, a jinja2
template like the one below will cause the task instance to fail, if the
DAG being executed is not a sub-DAG. This may also display an error on
the Rendered tab of the Task Instance page.
task_instance.xcom_pull('z', key='return_value', dag_id=dag.parent_dag.dag_id)
Prior to the change in this commit, the above template would pull the
latest value for task_id 'z', for the given execution_date, from *any DAG*.
If your task_ids between DAGs are all unique, or if DAGs using the same
task_id always have different execution_date values, this will appear to
act like dag_id=None.
Our current theory is SQLAlchemy/Python doesn't behave as expected when
comparing `jinja2.Undefined` to `None`.
* Added support for encrypted private keys in SSHHook
* Fixed Styling issues and added unit testing
* fixed last pylint styling issue by adding newline to the end of the file
* re-fixed newline issue for pylint checks
* fixed pep8 styling issues and black formatted files to pass static checks
* added comma as per suggestion to fix static check
Co-authored-by: Nadim Younes <nyounes@kobo.com>
This PR adds possibility to define template_fields_renderers for an
operator. In this way users will be able to provide information
what lexer should be used for rendering a particular field. This is
super useful for custom operator and gives more flexibility than
predefined keywords.
Co-authored-by: Kamil Olszewski <34898234+olchas@users.noreply.github.com>
Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>
closes: #10725
Make sure SkipMixin.skip_all_except() handles empty branches like this properly. When "task1" is followed, "join" must not be skipped even though it is considered to be immediately downstream of "branch".
In Pandas version 1.1.2 experimental NAN value started to be
returned instead of None in a number of places. That broke our tests.
Fixing the tests also requires the Pandas to be updated to be >=1.1.2
GitHub Actions allow to use `fromJson` method to read arrays
or even more complex json objects into the CI workflow yaml files.
This, connected with set::output commands, allows to read the
list of allowed versions as well as default ones from the
environment variables configured in
./scripts/ci/libraries/initialization.sh
This means that we can have one plece in which versions are
configured. We also need to do it in "breeze-complete" as this is
a standalone script that should not source anything we added
BATS tests to verify if the versions in breeze-complete
correspond with those defined in the initialization.sh
Also we do not limit tests any more in regular PRs now - we run
all combinations of available versions. Our tests run quite a
bit faster now so we should be able to run more complete
matrixes. We can still exclude individual values of the matrixes
if this is too much.
MySQL 8 is disabled from breeze for now. I plan a separate follow
up PR where we will run MySQL 8 tests (they were not run so far)
This commit introduces TaskGroup, which is a simple UI task grouping concept.
- TaskGroups can be collapsed/expanded in Graph View when clicked
- TaskGroups can be nested
- TaskGroups can be put upstream/downstream of tasks or other TaskGroups with >> and << operators
- Search box, hovering, focusing in Graph View treats TaskGroup properly. E.g. searching for tasks also highlights TaskGroup that contains matching task_id. When TaskGroup is expanded/collapsed, the affected TaskGroup is put in focus and moved to the centre of the graph.
What this commit does not do:
- This commit does not change or remove SubDagOperator. Although TaskGroup is intended as an alternative for SubDagOperator, deprecating SubDagOperator will need to be discussed/implemented in the future.
- This PR only implemented TaskGroup handling in the Graph View. In places such as Tree View, it will look like as-if
- TaskGroup does not exist and all tasks are in the same flat DAG.
GitHub Issue: #8078
AIP: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-34+TaskGroup%3A+A+UI+task+grouping+concept+as+an+alternative+to+SubDagOperator
multiprocessing.Process is set up in a very unfortunate manner
that pretty much makes it impossible to test a class that inherits from
Process or use any of its internal functions. For this reason we decided
to seperate the actual process based functionality into a class member
* Fetching databricks host from connection if not supplied in extras.
* Fixing formatting issue in databricks test
Co-authored-by: joshi95 <shubham@playsimple.in>
* Simplify Airflow on Kubernetes Story
Removes thousands of lines of code that essentially ammount to us
re-creating the Kubernetes API. Will offer a faster, simpler
KubernetesExecutor for 2.0
* Fix podgen tests
* fix documentation
* simplify validate function
* @mik-laj comments
* spellcheck
* spellcheck
* Update airflow/executors/kubernetes_executor.py
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
It seems that the test_find_not_should_ignore_path test has some
dependency on side-effects from other tests.
See #10988 - we are moving this test to heisentests until we
solve the issue.
Changed `Is` to `Passed`
Before:
```
ERROR: Allowed backend: [ sqlite mysql postgres ]. Is: 'dpostgres'.
Switch to supported value with --backend flag.
```
After:
```
ERROR: Allowed backend: [ sqlite mysql postgres ]. Passed: 'dpostgres'.
Switch to supported value with --backend flag.
```
This can happen when a task is enqueued by one executor, and then that
scheduler dies/exits.
The default fallback behaviour is unchanged -- that queued tasks are
cleared and then and then later rescheduled.
But for Celery, we can do better -- if we record the Celery-generated
task_id, we can then re-create the AsyncResult objects for orphaned
tasks at a later date.
However, since Celery just reports all AsyncResult as "PENDING", even if
they aren't tasks currently in the broker queue, we need to apply a
timeout to "unblock" these tasks in case they never actually made it to
the Celery broker.
This all means that we can adopt tasks that have been enqueued another
CeleryExecutor if it dies, without having to clear the task and slow
down. This is especially useful as the task may have already started
running, and while clearing it would stop it, it's better if we don't
have to reset it!
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
* Ensure that K8sPodOperator can pull namespace from pod_template_file
Fixes a bug where users who run K8sPodOperator could not run because
the operator was expecting a namespace parameter
* add test
* self.pod
* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
* don't create pod until run
* spellcheck
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
TestApiKerberos::test_trigger_dag previously was dependent that the `example_bash_operator` exist in the Database.
If one of the other tests didn't write it to the DB or if one of the other tests cleared it from the DB, this test failed.
it consists of CeleryExecutor and KubernetesExecutor, which allows users
to route their tasks to either Kubernetes or Celery based on the queue
defined on a task
- Instead of supporting only an Admin user in the base test class, you can also use a normal User or Viewer
- Only add users when they are being used so we can do a little less in the setup phase (minor speedup in TestDagACLView)
__lshift__ and __rshift__ methods should return other not self.
This PR fixes XComArg implementation to support chain like this one:
BaseOprator >> XComArg >> BaseOperator
Related to: #10153
If a task failed hard on celery, _before_ being able to execute the
airflow code the task would end up stuck in queued state. This change
makes it get retried.
This was discovered in load testing the HA work (but unrelated to HA
changes), where I swamped the kube-dns pod, meaning the worker was
sometimes unable to resolve the db name via DNS, so the state in the DB
was never updated
"airflow.providers.amazon.aws.secrets.secrets_manager." "SecretsManagerBackend.get_conn_uri"
to
"airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend.get_conn_uri"
* Modify helm chart to use pod_template_file
Since we are deprecating most k8sexecutor arguments
we should use the pod_template_file when launching airflow
using the KubernetesExecutor
* fix tests
* one more nit
* fix dag command
* fix pylint
The SmartSensor PR introduces slightly different behaviour on
list_py_files happens when given a file path directly.
Prior to that PR, if given a file path it would not include examples.
After that PR was merged, it would return that path and the example dags
(assuming they were enabled.)
Once HA mode for scheduler lands, we can no longer reset orphaned
task by looking at the tasks in (the memory of) the current executor.
This changes it to keep track of which (Scheduler)Job queued/scheduled a
TaskInstance (the new "queued_by_job_id" column stored against
TaskInstance table), and then we can use the existing heartbeat
mechanism for jobs to notice when a TI should be reset.
As part of this the existing implementation of
`reset_state_for_orphaned_tasks` has been moved out of BaseJob in to
BackfillJob -- as only this and SchedulerJob had these methods, and the
SchedulerJob version now operates differently
* Add podOverride setting for KubernetesExecutor
Users of the KubernetesExecutor will now have a "podOverride"
option in the executor_config. This option will allow users to
modify the pod launched by the KubernetesExecutor using a
`kubernetes.client.models.V1Pod` class. This is the first step
in deprecating the tradition executor_config.
* Fix k8s tests
* fix docs