Afer merging the constraints, the 'recursive' mode was not added
to checkout resulting with non-checked out github push action.
This commit fixes it and adds color to diff output in commit
to better show differences when pushing.
(cherry picked from commit 6b78394617)
They use the same python image as master (as already mentioned in the
comments in ci_prepare_prod_image_on_ci.sh) so we don't want to try
and push the python image when we aren't building the main branch.
(cherry picked from commit f94effeab1)
After stabilizing the builds on CI, the master builds started
to finally get green - except pushing to prod image cache which
continuous to fail as it missed python image to push.
This PR fixes it.
(cherry picked from commit 1dfbb8d203)
The PROD image of airflow is OpenShift compatible and it can be
run with either 'airflow' user (UID=50000) or with any other
user with (GID=0).
This change adds umask 0002 to make sure that whenever the image
is extended and new directories get created, the directories are
group-writeable for GID=0. This is added in the default
entrypoint.
The entrypoint will fail if it is not run as airflow user or if
other, arbitrary user is used with GID != 0.
Fixes: #15107
(cherry picked from commit ce91872ecc)
While checking the test status of various CI tests we came to
conclusion that Presto integration took a lot of memory (~1GB)
and was the main source of failures during integration tests,
especially with MySQL8. The attempt to fine-tune the memory
used turned out in the discovery, that Presto DB stopped
publishing their Docker image (prestosql/presto) - apparently
after the aftermath of splitting-off Trino from Presto.
Th split-off was already discussed in #14281 and it was planned
to add support for Trino (which is the more community-driven
fork of the Presto - Presto remained at Facebook Governance,
where Trino is an effort continued by the original creators.
You can read more about it in the announcement:
https://trino.io/blog/2020/12/27/announcing-trino.html. While
Presto continues their way under The Linux Foundation, Trino
lives its own live and keeps on maintaining all artifacts and
libraries (including the image). That allowed us to update
our tests and decrease the memory footprint by around 400MB.
This commit:
* adds the new Trino provider
* removes `presto` integration and replaces it with `trino`
* the `trino` integartion image is built with 400MB less memory
requirementes and published as `apache/airflow:trino-*`
* moves the integration tests from Presto to Trino
Fixes: #14281
(cherry picked from commit eae22cec9c)
Originally, the constraints were generated in separate jobs and uploaded as
artifacts and then joined be a separate push job. Thanks to parallel
processing, we can now do that all in a single job, with both cost and
time savings.
(cherry picked from commit aebacd7405)
K8S has a one-year support policy. This PR updates the
K8S versions we use to test to the latest available in three
supported versions of K8S as of now: 1.20, 1.19. 18.
The 1.16 and 1.17 versions are not supported any more as of today.
https://en.wikipedia.org/wiki/Kubernetes
This change also bumps kind to latest version (we use kind for
K8S testing) and fixes configuration to match this version.
(cherry picked from commit 36ab9dd7c4)
Not all docker commands are replaced with functions now.
Earlier wer replaced all docker commands with a function to be able
to capture docker commands used and display it with -v for breeze.
This has proven to be harmful as
this is an unexpected behaviour for a docker command.
This change introduces docker_v command which outputs the command
when needed.
(cherry picked from commit 535e1a8e69)
Some of the test jobs are hanging - either becasue of some
weird race conditions in docker or because the test hangs (happens
for quarantined tests). This change add maximum timeout we let
the test suite execute to 25 minutes.
(cherry picked from commit a4aee3f1d0)
This is far more complex than it should be because of
autoapi problems with parallel execution. Unfortunately autoapi
does not cope well when several autoapis are run in parallel on
the same code - even if they are run in separate processes and
for different packages. Autoapi uses common _doctree and _api
directories generated in the source code and they override
each other if two or more of them run in parallel.
The solution in this PR is mostly applicable for CI environment.
In this case we have docker images that have been already built
using current sources so we can safely run separate docker
containers without mapping the sources and run generation
of documentation separtely and independently in each container.
This seems to work really well, speeding up docs generation
2x in public GitHub runners and 8x in self-hosted runners.
Public runners:
* 27m -> 15m
Self-hosted runners:
* 27m -> < 8m
(cherry picked from commit 741a54502f)
* Fixes problem with two different files mdsumed with the same name
When we check whether we should rebuild image, we check if the
md5sum of some important files changed - which would trigger
question whether to rebuild the image or not (because of
changed dependencies which need to be installed). This
happens for example when package.json or yarn.lock changes.
Previously, all the important files had distinct names, so
we stored the md5 hashes of those files with just filenames +.md5sum
but they were flattened to a single directory. Unfortunately,
as of #14927 (merged with failing build) we had two package.json
and two yarn.locks and it caused overwriting of md5hash of one
by the other. This triggered unnecessary rebuilding of the image
in CI part which resulted in failure (because of Apache Beam
dependency problem).
This PR fixes it by adding parent directory to the name of
the md5sum file (so we have www-package.json and ui-package.json)
now. Those important files change very rarely so this incident
should not happen again but we added some comments preventing
it.
* Update scripts/ci/libraries/_initialization.sh
Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>
(cherry picked from commit 775ee51d0e)
The change #14911 had a bug - when PYTHON_MINOR_MAJOR_VERSION
was removed from the imge args, the replacement `python -m site`
expression missed `/airflow/` suffix. Unfortunately it was not
flagged as an error because the recompiling script silently
skipped recompilation step in such case.
This change:
* fixes the error
* removes the silent-skipping if check (the recompilation will
fail in case it is wrongly set)
* adds check at image verification whether dist/manifest.json is
present.
Fixes: #14991
This PR skips building Provider packages for branches different
than master. Provider packages are always released from master
but never from any other branch so there is no point in running
the package building and tests there
(cherry picked from commit df368f17df361af699dc868af9481ddc3abf0416)
Previously you had to specify AIRFLOW_VERSION_REFERENCE and
AIRFLOW_CONSTRAINTS_REFERENCE to point to the right version
of Airflow. Now those values are auto-detected if not specified
(but you can still override them)
This change allowed to simplify and restructure the Dockerfile
documentation - following the recent change in separating out
the docker-stack, production image building documentation has
been improved to reflect those simplifications. It should be
much easier to grasp by the novice users now - very clear
distinction and separation is made between the two types of
building your own images - customizing or extending - and it
is now much easier to follow examples and find out how to
build your own image. The criteria on which approach to
choose were put first and forefront.
Examples have been reviewed, fixed and put in a logical
sequence. From the most basic ones to the most advanced,
with clear indication where the basic aproach ends and where
the "power-user" one starts. The examples were also separated
out to separate files and included from there - also the
example Docker images and build commands are executable
and tested automatically in CI, so they are guaranteed
to work.
Finally The build arguments were split into sections - from most
basic to most advanced and each section links to appropriate
example section, showing how to use those parameters.
Fixes: #14848Fixes: #14255
The production image did not have root group set as default for
the airflow user. This was not a big problem unless you extended
the image - in which case you had to change the group manually
when copying the images in order to keep the image OpenShift
compatible (i.e. runnable with any user and root group).
This PR fixes it by changing default group of airflow user
to root, which also works when you extend the image.
```
Connected.
airflow@53f70b1e3675:/opt/airflow$ ls
dags logs
airflow@53f70b1e3675:/opt/airflow$ cd dags/
airflow@53f70b1e3675:/opt/airflow/dags$ ls -l
total 4
-rw-r--r-- 1 airflow root 1648 Mar 22 23:16 test_dag.py
airflow@53f70b1e3675:/opt/airflow/dags$
```
So far we had matrix of builds that verified images - each
image was verified by separate matrix-based job and those
verifications were run after all images were alredy available.
This step optimizes it. Those steps are run in the same job
as "waiting for image", also they run in parallel which will
make them a bit faster.
This verification is fast and it can be run on any machine
in parallel without any problems.
(cherry picked from commit c59ab1ddcd)
This is by far the biggest improvements of the test execution time
we can get now when we are using self-hosted runners.
This change drives down the time of executing all tests on
self-hosted runners from ~ 50 minutes to ~ 13 minutes due to heavy
parallelisation we can implement for different test types and the
fact that our machines for self-hosted runners are far more
capable - they have more CPU, more memory and the fact that
we are using tmpfs for everything.
This change will also drive the cost of our self-hosted runners
down. Since we have auto-scaling infrastructure we will simply need
the machines to run tests for far shorter time. Since the number
of test jobs we run on those self hosted runners is substantial
(10 jobs), we are going to save ~ 6 build hours per one PR/merged
commit!
This also allows the developers to use the power of their
development machines - when you use
`./scripts/ci/testing/ci_run_airflow_testing.sh` the script
detects how many CPU cores are available and it will run as
many parallel test types as many cores you have.
Also in case of Integration tests - they require more memory to run
all the integrations, so in case there is less than ~ 32 GB of RAM
available to Docker, the integration tests are run sequentially
at the end. This drives stability up for machines with lower memory.
On one personal PC (64GB RAM, 8 CPUS/16 cores, fast SSD) the full
test suite execution went down from 30 minutes to 5 minutes.
There is a continuous progress information printed every 10 seconds when
either parallel or sequential tests are run, and the full output is
shown at the end - failed tests are marked in red groups, and succesful are
marked in green groups. This makes it easier to see and analyse errors.
(cherry picked from commit 01a5d36e6b)
When Breeze is run, it requires some resources in the docker
engine, otherwise it will produce strange errors.
This PR adds resource check when running breeze - it will print
human-friendly size of CPU/Memory/Disk available for docker
engine and red error (still allowing Docker to run) when the
resources are not enough.
Fixes: #14899
(cherry picked from commit 3dd42a5a3f)
We are removing support for Backport Providers now.
The last release was sent yesterday- as planned, on 17 March 2021 - the
last release of the Backport Providers.
As agreed before, and documented here:
https://github.com/apache/airflow/blob/master/dev/PROJECT_GUIDELINES.md#support-for-backport-providers
> Backport providers within 1.10.x, will be supported for critical fixes
for three months (March 17, 2021) from Airflow 2.0.0 release date (Dec
17, 2020).
For the future reference, if anyone would like to build backport
providers with cherry-picking any fixes, the branch to start from is
`legacy-backport-cutoff-point`. The documentation and tools to build the
backports are there, but there will be no more community releases for
backports.
Good Bye Backport Providers.
(cherry picked from commit 68e4c4dcb0)
This is by far the biggest improvements of the test execution time
we can get now when we are using self-hosted runners.
This change drives down the time of executing all tests on
self-hosted runners from ~ 50 minutes to ~ 13 minutes due to heavy
parallelisation we can implement for different test types and the
fact that our machines for self-hosted runners are far more
capable - they have more CPU, more memory and the fact that
we are using tmpfs for everything.
This change will also drive the cost of our self-hosted runners
down. Since we have auto-scaling infrastructure we will simply need
the machines to run tests for far shorter time. Since the number
of test jobs we run on those self hosted runners is substantial
(10 jobs), we are going to save ~ 6 build hours per one PR/merged
commit!
This also allows the developers to use the power of their
development machines - when you use
`./scripts/ci/testing/ci_run_airflow_testing.sh` the script
detects how many CPU cores are available and it will run as
many parallel test types as many cores you have.
Also in case of Integration tests - they require more memory to run
all the integrations, so in case there is less than ~ 32 GB of RAM
available to Docker, the integration tests are run sequentially
at the end. This drives stability up for machines with lower memory.
On one personal PC (64GB RAM, 8 CPUS/16 cores, fast SSD) the full
test suite execution went down from 30 minutes to 5 minutes.
There is a continuous progress information printed every 10 seconds when
either parallel or sequential tests are run, and the full output is
shown at the end - failed tests are marked in red groups, and succesful are
marked in green groups. This makes it easier to see and analyse errors.
(cherry picked from commit 5539069ea5)
The Parallel tests from #14531 created a good opportunity to
reproduce some of the race conditions that cause some of the
scheduler job test to be flaky.
This change is an attempt to fix three of the flaky tests
there by removing side effects between tests. The previous
implementation did not take into account that scheduler job
processes might still be running when the test finishes and
the tests could have unintended side effects - especially
when they were run on a busy machine.
This PR adds mechanism that stops all running
schedulerJob processes in tearDown before cleaning
the database.
Fixes: #14778Fixes: #14773Fixes: #14772Fixes: #14771Fixes: #11571Fixes: #12861Fixes: #11676Fixes: #11454Fixes: #11442Fixes: #11441
(cherry picked from commit 45cf89ce51)
There are many more references to "master" (even in our own repo) than
this, but this commit is the first step: to that process.
It makes CI run on the main branch (once it exists), re-words a few
cases where we can to easily not refer to master anymore.
This doesn't yet re-name the `constraints-master` or `master-*` images -
that will be done in a future PR.
(We don't be able to entirely eliminate "master" from our repo as we
refer to a lot of other GitHub repos that we can't change.)
(cherry picked from commit 0dea083fcb)
The base python image is only updated when manually triggered and
in case of checking for upgraded dependencies in master build.
While automated upgrade to latest Python image is good for
security, it can cause a number of problems when run automatically
in the CI:
* cache invalidation - thus longer builds
* sudden test failures
This happened in the past already quite a number of times so it
is time to switch to a bit different mode. Python images will only
be automatically upgraded in those cases:
1) When Master CI build is run in scheduled nightly build - to check
that tests still pass for latest version of the image
2) When manually refreshed with --force-pull-base-python-image
3) When DockerHub official images (from tags) are built.
The procedure to refresh the images manually in our CI has been
added to the documentation.
(cherry picked from commit 4762396b8b)
Sometimes (very rarely) some 'wait for image' pulling steps
loop forever (while other steps from parallell jobs pulling the
same image have no problems).
Example here:
Failed step here:
* https://github.com/apache/airflow/runs/2106723280?check_suite_focus=true#step:5:349
Another similar step in parallel job had no problems with retrieving the
same image earlier:
* https://github.com/apache/airflow/runs/2106723269?check_suite_focus=true#step:5:119
Both images pulled the same image:
docker.pkg.github.com/apache/airflow/master-python3.6-ci-v2:651461418
This change adds diagnostics information that might provide more
information in case this happens again so that we can understand
what is going on and mitigate the issue.
(cherry picked from commit 4cde47b339)
Sometimes base python image patchlevel might case failure of tests.
This happens for example with test_views.py tests fixed in #14719
where CVE fix in all python versions caused our tests to fail.
There was an error in our scripts - when --force-pull-images
were used, the base python version was not updated to the
latest version even if there was a newer one and it caused our
images to bounce few times between two latest patchlevels
when they were manually refreshed.
This change fixes it so that the base python image is always used
when
a) FORCE_PULL_IMAGES is true or
b) UPGRADE_TO_NEWER_DEPENDENCIES is != false
This will cause python upgrade in two cases:
- when images are rebuilt with --force-pull-images locally
- when images are upgraded with newer dependencies on master
(cherry picked from commit 61b448221d)
The whole Backfill class was in Heisentest but only one of those tests
is problematic nowi: test_backfill_depends_on_past. Therfore it makes
sense to remove the class from heisentests and move the
depends_on_past to quarantine.
It turned out that this is the last "Heisentest" and with the
isolation we have now coming in parallel tests, it turns out that
Heisentests are not really good way thinking about the tests - running
them in isolation does not often help, it only makes it more difficult
to flag the tests as flaky.
The quarantine test_backfill_depends_on_past ihas been captured in
the #14755 issue - and hopefully we will make an effort to
de-quarantine some of those tests soon.
(cherry picked from commit 4ce952e7c2)
AIRFLOW_COMMITERS was left over from a previous PR and should have never
been included.
EVENT_NAME and TARGET_COMMIT_SHA aren't needed as GitHub already provide
us GITHUB_EVENT_NAME and GITHUB_SHA with the same values.
(cherry picked from commit c9a7314fce)
Now that we are using Py3.6+ we can rely on dictionary key order to be
fixed (it was always fixed in 3.6, just not explicitly documented as
such from 3.7) -- as a result we can load the source, rather than try to
parse it with regexes
(cherry picked from commit 775f807005)
The change to redirect all breeze output to a file had the effect of
making stdin no longer a TTY, which meant that the docker container had
a fixed with of 80 columns.
The fix is to replace the use of `tee` with `scripts`, which does what
we want -- captures stdout+stderr, but makes the running program believe
that it is still attached to a terminal (if it ever was).
(cherry picked from commit 4c1e3c8a16)
* Use airflow db check in entrypoint_prod.sh
* fixup! Use airflow db check in entrypoint_prod.sh
Co-authored-by: Kamil Breguła <kamilbregula@apache.org>
(cherry picked from commit e273366ff8)
This makes loading local providers 1/3 quicker -- from 2s down from 3s
on my local SSD.
The `airflow.utils.yaml` module can be used in place of the normal yaml
module, with the bonus that `safe_load` will use libyaml where available
instead of always using the pure python version.
This shaves 3 minutes off the "WWW" tests - down to 8 minutes from
11 minutes.
I have not used this module in tests/docs code etc, as I don't want to
force importing `airflow` (and everything in currently brings in) in to
those contexts.
(cherry picked from commit 7daebefd15)
Since https://github.com/apache/airflow/pull/12188 was merged I
don't think we need this steps.
This step also caused the docker build step for 2.0.1rc2 to fail
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
(cherry picked from commit 3ffd21745d)
This change implements per-provider versioning tools. Version of the
providers is retrieved from provider.yaml file (top-level verion).
Documentation is generated in the documentation folder rather than
in sources and embedded in provider's index. Backport providers
remain as they were until we delete all the backport references in
April 2021 nd then the code can be simplified and the
backport functionality can be removed then.
When generating multiple providers, only those that have version
that has no corresponding `providers-<PROVIDER>/<VERSION>` are
generated. Other providers are skipped with warnings.
Old documentation is removed and new CHANGELOG.rst have been
prepared for all providers to accomodate to the new process
(which is comming as a follow-up commit)
Fixes: #13272, #13271, #13274, #13276, #13277, #13275, #13273
(cherry picked from commit ac2f72c98d)
The asset recompilation message did not work well in case of the
tests - where we did not mount local sources. It was always
showign the instructions to recompile the assets.
This is now fixed and the "OK" message is green.
(cherry picked from commit 9f35cff16f)
* Production image can be run as root
* fixup! Production image can be run as root
* fixup! fixup! Production image can be run as root
Co-authored-by: Kamil Bregula <kamilbregula@Kamils-MacBook-Pro.local>
Co-authored-by: Kamil Breguła <kamilbregula@apache.org>
(cherry picked from commit 7979b7581c)