While checking the test status of various CI tests we came to
conclusion that Presto integration took a lot of memory (~1GB)
and was the main source of failures during integration tests,
especially with MySQL8. The attempt to fine-tune the memory
used turned out in the discovery, that Presto DB stopped
publishing their Docker image (prestosql/presto) - apparently
after the aftermath of splitting-off Trino from Presto.
Th split-off was already discussed in #14281 and it was planned
to add support for Trino (which is the more community-driven
fork of the Presto - Presto remained at Facebook Governance,
where Trino is an effort continued by the original creators.
You can read more about it in the announcement:
https://trino.io/blog/2020/12/27/announcing-trino.html. While
Presto continues their way under The Linux Foundation, Trino
lives its own live and keeps on maintaining all artifacts and
libraries (including the image). That allowed us to update
our tests and decrease the memory footprint by around 400MB.
This commit:
* adds the new Trino provider
* removes `presto` integration and replaces it with `trino`
* the `trino` integartion image is built with 400MB less memory
requirementes and published as `apache/airflow:trino-*`
* moves the integration tests from Presto to Trino
Fixes: #14281
(cherry picked from commit eae22cec9c)
This is by far the biggest improvements of the test execution time
we can get now when we are using self-hosted runners.
This change drives down the time of executing all tests on
self-hosted runners from ~ 50 minutes to ~ 13 minutes due to heavy
parallelisation we can implement for different test types and the
fact that our machines for self-hosted runners are far more
capable - they have more CPU, more memory and the fact that
we are using tmpfs for everything.
This change will also drive the cost of our self-hosted runners
down. Since we have auto-scaling infrastructure we will simply need
the machines to run tests for far shorter time. Since the number
of test jobs we run on those self hosted runners is substantial
(10 jobs), we are going to save ~ 6 build hours per one PR/merged
commit!
This also allows the developers to use the power of their
development machines - when you use
`./scripts/ci/testing/ci_run_airflow_testing.sh` the script
detects how many CPU cores are available and it will run as
many parallel test types as many cores you have.
Also in case of Integration tests - they require more memory to run
all the integrations, so in case there is less than ~ 32 GB of RAM
available to Docker, the integration tests are run sequentially
at the end. This drives stability up for machines with lower memory.
On one personal PC (64GB RAM, 8 CPUS/16 cores, fast SSD) the full
test suite execution went down from 30 minutes to 5 minutes.
There is a continuous progress information printed every 10 seconds when
either parallel or sequential tests are run, and the full output is
shown at the end - failed tests are marked in red groups, and succesful are
marked in green groups. This makes it easier to see and analyse errors.
(cherry picked from commit 01a5d36e6b)
We are removing support for Backport Providers now.
The last release was sent yesterday- as planned, on 17 March 2021 - the
last release of the Backport Providers.
As agreed before, and documented here:
https://github.com/apache/airflow/blob/master/dev/PROJECT_GUIDELINES.md#support-for-backport-providers
> Backport providers within 1.10.x, will be supported for critical fixes
for three months (March 17, 2021) from Airflow 2.0.0 release date (Dec
17, 2020).
For the future reference, if anyone would like to build backport
providers with cherry-picking any fixes, the branch to start from is
`legacy-backport-cutoff-point`. The documentation and tools to build the
backports are there, but there will be no more community releases for
backports.
Good Bye Backport Providers.
(cherry picked from commit 68e4c4dcb0)
The whole Backfill class was in Heisentest but only one of those tests
is problematic nowi: test_backfill_depends_on_past. Therfore it makes
sense to remove the class from heisentests and move the
depends_on_past to quarantine.
It turned out that this is the last "Heisentest" and with the
isolation we have now coming in parallel tests, it turns out that
Heisentests are not really good way thinking about the tests - running
them in isolation does not often help, it only makes it more difficult
to flag the tests as flaky.
The quarantine test_backfill_depends_on_past ihas been captured in
the #14755 issue - and hopefully we will make an effort to
de-quarantine some of those tests soon.
(cherry picked from commit 4ce952e7c2)
The provider.yaml contains more information that required at
runtime (specifically about documentation building). Those
fields are not needed at runtime and their presence is optional.
Also the runtime check for provider information should be more
relexed and allow for future compatibility (with
additional properties set to false). This way we can add new,
optional fields to provider.yaml without worrying about breaking
future-compatibility of providers with future airflow versions.
This changei restores 'additionalProperties': false in the
main, development-focused provider.yaml schema and introduced
new runtime schema that is used to verify the provider info when
providers are discovered by airflow.
This 'runtime' version should change very rarely as change to
add a new required property in it breaks compatibility of
providers with already released versions of Airflow.
We also trim-down the provider.yaml file when preparing provider
packages to only contain those fields that are required in the
runtime schema.
(cherry picked from commit ad2a030b9e)
Since TESTING.rst is not published on Apache Site, we don't run spell check on it and hence there were some typos introuduced without getting noticed.
Time to fix them
(cherry picked from commit 5cf2fbf124)
So far, the production images of Airflow were using sources
when they were built on CI. This PR changes that, to build
airflow + providers packages first and install them
rather than use sources as installation mechanism.
Part of #12261
For Kubernetes tests all tests can be executed in the same python
version - default one - no matter which PYTHON_MAJOR_MINOR is
used. This is because we are testing Airflow which is deployed
via production image. Thanks to that we can fix the python version
to be default and avoid any python version problems (this is
especially important for cherry-picking to 1.10 where we have
python 2.7 and 3.5.
The K9s is fantastic tool that helps to debug a running k8s
instance. It is terminal-based windowed CLI that makes you
several times more productive comparing to using kubectl
commands. We've integrated k9s (it is run as a docker container
and downloaded on demand). We've also separated out KUBECONFIG
of the integrated kind cluster so that it does not mess with
kubernetes configuration you might already have.
Also - together with that the "surrounding" of the kubernetes
tests were simplified and improved so that the k9s integration
can be utilized well. Instead of kubectl port forwarding (which
caused multitude of problems) we are now utilizing kind's
portMapping feature + custom NodePort resource that maps
port 8080 to 30007 NodePort which in turn maps it to 8080
port of the Webserver. This way we do not have to establish
an external kubectl port forward which is prone to error and
management - everything is brought up when Airflow gets
deployed to the Kind Cluster and shuts down when the Kind
cluster is stopped.
Yet another problem fixed was killing of postgres by one of the
kubernetes tests ('test_integration_run_dag_with_scheduler_failure').
Instead of just killing the scheduler it killed all pods - including
the Postgres one (it was named 'airflow-postgres.*'). That caused
various problems, as the database could be left in a strange state.
I changed the tests to do what it claimed was doing - so killing only the
scheduler during the test. This seemed to improve the stability
of tests immensely in my local setup.
In preparation for adding provider packages to 2.0 line we
are renaming backport packages to provider packages.
We want to implement this in stages - first to rename the
packages, then split-out backport/2.0 providers as part of
the #11421 issue.
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.
This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (https://github.com/apache/airflow/issues/10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.
There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.
Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.
Or simply those tests are more stable when run isolation.
We might still add failing tests back as soon we see them behave
in a flaky way.
The remaining quarantined tests that need to be fixed:
* test_local_run (often hangs the build)
* test_retry_handling_job
* test_clear_multiple_external_task_marker
* test_should_force_kill_process
* test_change_state_for_tis_without_dagrun
* test_cli_webserver_background
We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
* TestImpersonation tests
We might find that those heisentest can be fixed but for
now we are going to run them in isolation.
Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.
The Kubernetes tests are now run using Helm chart
rather than the custom templates we used to have.
The Helm Chart uses locally build production image
so the tests are testing not only Airflow but also
Helm Chart and a Production image - all at the
same time. Later on we will add more tests
covering more functionalities of both Helm Chart
and Production Image. This is the first step to
get all of those bundle together and become
testable.
This change introduces also 'shell' sub-command
for Breeze's kind-cluster command and
EMBEDDED_DAGS build args for production image -
both of them useful to run the Kubernetes tests
more easily - without building two images
and with an easy-to-iterate-over-tests
shell command - which works without any
other development environment.
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Daniel Imberman <daniel@astronomer.io>
For a long time the way how entrypoint worked in ci scripts
was wrong. The way it worked was convoluted and short of black
magic. This did not allow to pass multiple test targets and
required separate execute command scripts in Breeze.
This is all now straightened out and both production and
CI image are always using the right entrypoint by default
and we can simply pass parameters to the image as usual without
escaping strings.
This also allowed to remove some breeze commands and
change names of several flags in Breeze to make them more
meaningful.
Both CI and PROD image have now embedded scripts for log
cleaning.
History of image releases is added for 1.10.10-*
alpha quality images.
Tests requiring Kubernetes Cluster are now moved out of
the regular CI tests and moved to "kubernetes_tests" folder
so that they can be run entirely on host without having
the CI image built at all. They use production image
to run the tests on KinD cluster and we add tooling
to start/stop/deploy the application to the KinD cluster
automatically - for both CI testing and local development.
This is a pre-requisite to convert the tests to convert the
tests to use the official Helm Chart and Docker images or
Apache Airflow.
It closes#8782
We have now mechanism to keep release notes updated for the
backport operators in an automated way.
It really nicely generates all the necessary information:
* summary of requirements for each backport package
* list of dependencies (including extras to install them) when package
depends on other providers packages
* table of new hooks/operators/sensors/protocols/secrets
* table of moved hooks/operators/sensors/protocols/secrets with
information where they were moved from
* changelog of all the changes to the provider package (this will be
automatically updated with incremental changelog whenever we decide to
release separate packages.
The system is fully automated - we will be able to produce release notes
automatically (per-package) whenever we decide to release new version of
the package in the future.
The CRON job from previous runs did not have everything working
after the emergency migration to Github Actions.
This change brings back following improvements:
* rebuilding images from the scratch in CRON job
* automatically upgrading all requirements to test if they are new
* pushing production images to github packages as cache
* pushing nightly tag to github
Originally Breeze was used to run unit and integration tests, recently system
tests and finally we make it a bit more friendly to test your DAGs there. You
can now install any older airflow version in Breeze via
--install-airflow-version switch and "files/dags" folder is mounted to
"/files/dags" and this folder is used to read the dags from.
This change introduces sub-commands in breeze tool.
It is much needed as we have many commands now
and it was difficult to separate commands from flags.
Also --help output was very long and unreadable.
With this change help it is much easier to discover
what breeze can do for you as well as navigate with it.
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
We will run system test on back-ported operators for 1.10* series of airflow
and for that we need to have support for running system tests using pytest's
markers and reading environment variables passed from HOST machine (to pass
credentials).
This is the first step to automate system tests execution.