Seems that port forwarding during kubernetes tests started to behave
erratically - seems that kubectl port forward sometimes might hang
indefinitely rather than connect or fail.
We change the strategy a bit to try to allocate
increasing port numbers in case something like that happens.
Seems that by splitting the tests into many small jobs has a bad
effect - since we only have queue size = 180 for the whole Apache
organisation, we are competing with other projects for the jobs
and with the jobs being so short we got starved much more than if
we had long jobs. Therefore we are re-combining the test types into
single jobs per Python version/Database version and run all the
tests sequentially on those machines.
* Separate changes/readmes for backport and regular providers
We have now separate release notes for backport provider
packages and regular provider packages.
They have different versioning - backport provider
packages with CALVER, regular provider packages with
semver.
* Added support for provider packages for Airflow 2.0
This change consists of the following changes:
* adds provider package support for 2.0
* adds generation of package readme and change notes
* versions are for now hard-coded to 0.0.1 for first release
* adds automated tests for installation of the packages
* rename backport package readmes/changes to BACKPORT_*
* adds regulaar packge readmes/changes
* updates documentation on generating the provider packaes
* adds CI tests for the packages
* maintains backport packages generation with --backports flag
Fixes#11421Fixes#11424
We disabled duplicate cancelling on push/schedule in #11397
but then it causes a lot of extra strain in case several commits
are merged in quick succession. The master merges are always
full builds and take a lot of time, but if we merge PRs
quickly, the subsequent merge cancels the previous ones.
This has the negative consequence that we might not know who
broke the master build, but this happens rarely enough to suffer
the pain at expense of much less strained queue in GitHub Actions.
In preparation for adding provider packages to 2.0 line we
are renaming backport packages to provider packages.
We want to implement this in stages - first to rename the
packages, then split-out backport/2.0 providers as part of
the #11421 issue.
Now, when we have many more jobs to run, it might happen that
when a lot of PRs are submitted one-after-the-other there might
be longer waiting time for building the image.
There is only one waiting job per image type, so it does not
cost a lot to wait a bit longer, in order to avoid cancellation
after 50 minutes of waiting.
This is final step of implementing #10507 - selective tests.
Depending on files changed by the incoming commit, only subset of
the tests are exucted. The conditions below are evaluated in the
sequence specified below as well:
* In case of "push" and "schedule" type of events, all tests
are executed.
* If no important files and folders changed - no tests are executed.
This is a typical case for doc-only changes.
* If any of the environment files (Dockerfile/setup.py etc.) all
tests are executed.
* If no "core/other" files are changed, only the relevant types
of tests are executed:
* API - if any of the API files/tests changed
* CLI - if any of the CLI files/tests changed
* WWW - if any of the WWW files/tests changed
* Providers - if any of the Providers files/tests changed
* Integration Heisentests, Quarantined, Postgres and MySQL
runs are always run unless all tests are skipped like in
case of doc-only changes.
* If "Kubernetes" related files/tests are changed, the
"Kubernetes" tests with Kind are run. Note that those tests
are run separately using Host environment and those tests
are stored in "kubernetes_tests" folder.
* If some of the core/other files change, all tests are run. This
is calculated by substracting all the files count calculated
above from the total count of important files.
Fixes: #10507
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.
This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.
The SHA of cancel-workflow-action in #11397 was pointing to previous
(3.1) version of the action. This PR fixes it to point to the
right (3.2) version.
A problem was introduced in #11397 where a bit too many "Build Image"
jobs is being cancelled by subsequent Build Image run. For now it
cancels all the Build Image jobs that are running :(.
The push and schedule builds should not be cancelled even if
they are duplicates. By seing which of the master merges
failed, we have better visibility on which merge caused
a problem and we can trace it's origin faster even if the builds
will take longer overall.
Scheduled builds also serve it's purpose and they should
be always run to completion.
Replaces the annoying comments with "workflow_run" links
with Run Checks. Now we will be able to see the "Build Image"
checks in the "Checks" section including their status and direct
link to the steps running the image builds as "Details" link.
Unfortunately Github Actions do not handle well the links to
details - even if you provide details_url to link to the other
run, the "Build Image" checks appear in the original workflow,
that's why we had to introduce another link in the summary of
the Build Image check that links to the actual workflow.
The PR builds are now better handled with regards to both
running (using merge-request) and canceling (with cancel notifications).
First of all we are using merged commit from the PR, not the original commit
from the PR.
Secondly - the workflow run notifies the original PR with comment
stating that the image is being built in a separate workflow -
including the link to that workflow.
Thirdly - when canceling duplicate PRs or PRs with failed
jobs, the workflow will add a comment to the PR stating the
reason why the PR is being cancelled.
Last but not least, we also add cancel job for the CodeQL duplicate
messages. They run for ~ 12 miinutes so it makes perfect sense to
also cancel those CodeQL jobs for which someone pushed fixups in a
quick succession.
Fixes: #10471
It has been raised quite a few times that workflow added in forked
repositories might be pretty invasive for the forks - especially
when it comes to scheduled workflows as they might eat quota
or at least jobs for those organisations/people who fork
repositories.
This is not strictly necessary because Recently GitHub recognized this as being
a problem and introduced new rules for scheduled workflows. But for people who
are already forked, it would be nice to not run those actions. It is enough
that the CodeQL check is done when PR is opened to the "apache/airflow"
repository.
Quote from the emails received by Github (no public URL explaining it yet):
> Scheduled workflows will be disabled by default in forks of public repos and in
public repos with no activity for 60 consecutive days. We’re making two
changes to the usage policy for GitHub Actions. These changes will enable
GitHub Actions to scale with the incredible adoption we’ve seen from the GitHub
community. Here’s a quick overview:
> * Starting today, scheduled workflows will be disabled by default in new forks of
public repositories.
> * Scheduled workflows will be disabled in public repos with
no activity for 60 consecutive days.
In very rare cases, the waiting job might not be cancelled when
the "Build Image" job fails or gets cancelled on its own.
In the "Build Image" workflow we have this step:
- name: "Canceling the CI Build source workflow in case of failure!"
if: cancelled() || failure()
uses: potiuk/cancel-workflow-runs@v2
with:
token: ${{ secrets.GITHUB_TOKEN }}
cancelMode: self
sourceRunId: ${{ github.event.workflow_run.id }}
But when this step fails or gets cancelled on its own before
cancel is triggered, the "wait for image" steps could
run for up to 6 hours.
This change sets 50 minutes timeout for those jobs.
Fixes#11114
I noticed that when there is no setup.py changes, the constraints
are not upgraded automatically. This is because of the docker
caching strategy used - it simply does not even know that the
upgrade of pip should happen.
I believe this is really good (from security and incremental updates
POV to attempt to upgrade at every successfull merge (not that
the upgrade will not be committed if any of the tests fail and this
is only happening on every merge to master or scheduled run.
This way we will have more often but smaller constraint changes.
Depends on #10828
We introduced deletion of the old artifacts as this was
the suspected culprit of Kubernetes Job failures. It turned out
eventually that those Kubernetes Job failures were caused by
the #11017 change, but it's good to do housekeeping of the
artifacts anyway.
The delete workflow action introduced in a hurry had two problems:
* it runs for every fork if they sync master. This is a bit
too invasive
* it fails continuously after 10 - 30 minutes every time
as we have too many old artifacts to delete (GitHub has
90 days retention policy so we have likely tens of
thousands of artifacts to delete)
* it runs every hour and it causes occasional API rate limit
exhaustion (because we have too many artifacts to loop trough)
This PR introduces filtering with the repo, changes the frequency
of deletion to be 4 times a day. Back of the envelope calculation
tops 4/day at 2500 artifacts to delete at every run so we have low risk
of reaching 5000 API calls/hr rate limit. and adds script that we are
running manually to delete those excessive artifacts now. Eventually
when the number of artifacts goes down the regular job should delete
maybe a few hundreds of artifacts appearing within the 6 hours window
in normal circumstances and it should stop failing then.
GitHub Actions allow to use `fromJson` method to read arrays
or even more complex json objects into the CI workflow yaml files.
This, connected with set::output commands, allows to read the
list of allowed versions as well as default ones from the
environment variables configured in
./scripts/ci/libraries/initialization.sh
This means that we can have one plece in which versions are
configured. We also need to do it in "breeze-complete" as this is
a standalone script that should not source anything we added
BATS tests to verify if the versions in breeze-complete
correspond with those defined in the initialization.sh
Also we do not limit tests any more in regular PRs now - we run
all combinations of available versions. Our tests run quite a
bit faster now so we should be able to run more complete
matrixes. We can still exclude individual values of the matrixes
if this is too much.
MySQL 8 is disabled from breeze for now. I plan a separate follow
up PR where we will run MySQL 8 tests (they were not run so far)
When we ported the new CI mechanism to v1-10-test it turned out
that we have to correct the retrieval of DEFAULT BRANCH
and DEFAULT_CONSTRAINTS_BRANCH.
Since we are building the images using the "master" scripts, we need to
make sure the branches are retrieved from _initialization.sh of the
incoming PR, not from the one in the master branch.
Additionally versions 2.7 and 3.5 builds have to be merged to
master and excluded when the build is run targeting master branch.
The cache in Github Actions is immutable - once you create it
it cannot be modified. That's why cache keys should contain
hash of all files that are used to create the cache.
Kubernetes cache key did not contain it, and as a side effect
the cache from master kubernetes setup.py was used in the v1-10-test
after the breeze changes were cherry-picked.
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (https://github.com/apache/airflow/issues/10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.
There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.
Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.
Or simply those tests are more stable when run isolation.
We might still add failing tests back as soon we see them behave
in a flaky way.
The remaining quarantined tests that need to be fixed:
* test_local_run (often hangs the build)
* test_retry_handling_job
* test_clear_multiple_external_task_marker
* test_should_force_kill_process
* test_change_state_for_tis_without_dagrun
* test_cli_webserver_background
We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
* TestImpersonation tests
We might find that those heisentest can be fixed but for
now we are going to run them in isolation.
Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.
With recent refactors, nightly tag was not pushed on
scheduled event because it was depending on pushing images
to github registry. Pushing images to github registry is
skipped on scheduled builds, so pushing tag was also skipped.
You can now define secret in your own fork:
AIRFLOW_GITHUB_REGISTRY_WAIT_FOR_IMAGE
If you set it to "false", it skips building images in separate
workflow_run - images will be built in the jobs run in the
CI Build run and they won't be pushed to the registry.
Note - you can't have secrets starting with GITHUB_, that's why
the AIRFLOW_* prefix
In normal circumstances those jobs will wait for a short time
(4-15 minutes depenfding on the state of the base image).
However there might be some cases when there are a lot of jobs
or when there is some queueing problems in GitHub that
the "Build Images" job will be queued and not start quickly.
This happened on 24th of August 2020 for example when several
jobs failed because the "Build Image" was queued and only
run after the "CI Build" job timed out.
Usually those situations tends to be resolved by GitHub support
or they resolve themselves as the jobs will be finishing and
freing the queue. However in those cases we should give the
waiting job as much time as GitHub Action allows by default
for the job to run (360 minutes). This is no harm - we can
alwayc cancel those jobs manually and they are just two
jobs running so it should not cause any problem.
Note that if someone would see that the job is running for
a long time - the contributor will likely push amended
commit and it will also cancel such waiting job, so
this is even less likely to have long runnning waiting jobs.
We had to enable mounting from sources for a short while
because we had to find a way to add new scripts to the
"workflow_run" workflow we have. This also requires
the #10470 to be merged - perf_kit to be moved to tests.utils because
it was in a separate directory and image without mounting sources
could not run the tests.
It also partially addresses the #10445 problem where
there was difference between sources in the image and coming
from the master. This comes from GitHub running merge on
non-conflicting changes in the PR and something that will
be addressed shortly.
The issue #10471 discusses this in detail.
We do not have to rebuild PROD images now because we changed
the strategy of preparing the image for k8s tests instead of
embedding dags during build with EMBEDDED_DAGS build arg we
are now extending the image with FROM: clause and add dags
on top of the PROD base image. We've been rebuilding the
image twice during each k8s run - once in Prepare PROD image
and once in "Deploy airflow to cluster" both are not needed
and both lasted ~ 2m 30s, so we should save around 5m for every
K8S jobi (~30% as the while K8S test job is around 15m).
* Change Support Request template to a link to Slack
* Update .github/ISSUE_TEMPLATE/config.yml
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
There is still one build running for forks regularly, even though
we disabled all scheduled runs in #10448, there is still one
case with nightly build that we should disable.
The run is the "workflow_run"
executed for the nightly scheduled "CI Build" run that still gets
triggered.
This change skips those run in forjs in case the "source event"
is "schedule"
After #10368, we've changed the way we build the images
on CI. We are overriding the ci scripts that we use
to build the image with the scripts taken from master
to not give roque PR authors the possibiility to run
something with the write credentials.
We should not override the in_container scripts, however
because they become part of the image, so we should use
those that came with the PR. That's why we have to move
the "in_container" scripts out of the "ci" folder and
only override the "ci" folder with the one from
master. We've made sure that those scripts in ci
are self-contained and they do not need reach outside of
that folder.
Also the static checks are done with local files mounted
on CI because we want to check all the files - not only
those that are embedded in the container.
- refactor/change azure_container_instance to use AzureBaseHook
- add info to operators-and-hooks-ref.rst
- add howto docs for connecting to azure
- add auth mechanism via json config
- add azure conn type