While cherry-picking docker image changei to v2-0-test, the
value of the arg was wrongly rename (similarly to other parameters) with
`constraintis-2.0` where it should remain as `constraints`.
This is the name of constraint file to use, and it's value might
be either `constraints-no-providers`, `constraints`, or
`constraints-source-providers`.
This change restores proper default of the arg.
Fixes: #15493
The PROD image of airflow is OpenShift compatible and it can be
run with either 'airflow' user (UID=50000) or with any other
user with (GID=0).
This change adds umask 0002 to make sure that whenever the image
is extended and new directories get created, the directories are
group-writeable for GID=0. This is added in the default
entrypoint.
The entrypoint will fail if it is not run as airflow user or if
other, arbitrary user is used with GID != 0.
Fixes: #15107
(cherry picked from commit ce91872ecc)
* Upgrades moto to newer version (~=2.0)
According to https://github.com/spulec/moto/issues/3535#issuecomment-808706939
1.3.17 version of moto with a fix to be compatible with mock> 4.0.3 is
not going to be released because of breaking changes. Therefore we need
to migrate to newer version of moto.
At the same time we can get rid of the old botocore limitation, which
was added apparently to handle some test errors. We are relying fully
on what boto3 depends on.
Upgrading dependencies also discovered that mysql tests need to
be fixed because upgraded version of dependencies cause some test
failure (those turned out to be badly written tests).
* Adds dill exclusion to Dockerfiles to accomodate upcoming beam fix
With the upcoming apache-beam change where mock library will be
removed from install dependencies, we will be able to remove
`apache-beam` exclusion in our CI scripts. This will be a final
step of cleaning dependencies so that we have a truly
golden set of constraints that will allow to install airflow
and all community managed providers (we managed to fix all those
dependency issues for all packages but apache-beam).
The fix https://github.com/apache/beam/pull/14328 when merged
and Apache Beam is released will allow us to migrate to the new
version and get rid of the CI exclusion for beam.
Closes: #14994
(cherry picked from commit ec962b01b7)
Documentation update for the four previously excluded providers that
got extra fixes/bumping to the latest version of the libraries.
* apache.beam
* apache.druid
* microsoft.azure
* snowflake
(cherry picked from commit b753c7fa60)
Previously you had to specify AIRFLOW_VERSION_REFERENCE and
AIRFLOW_CONSTRAINTS_REFERENCE to point to the right version
of Airflow. Now those values are auto-detected if not specified
(but you can still override them)
This change allowed to simplify and restructure the Dockerfile
documentation - following the recent change in separating out
the docker-stack, production image building documentation has
been improved to reflect those simplifications. It should be
much easier to grasp by the novice users now - very clear
distinction and separation is made between the two types of
building your own images - customizing or extending - and it
is now much easier to follow examples and find out how to
build your own image. The criteria on which approach to
choose were put first and forefront.
Examples have been reviewed, fixed and put in a logical
sequence. From the most basic ones to the most advanced,
with clear indication where the basic aproach ends and where
the "power-user" one starts. The examples were also separated
out to separate files and included from there - also the
example Docker images and build commands are executable
and tested automatically in CI, so they are guaranteed
to work.
Finally The build arguments were split into sections - from most
basic to most advanced and each section links to appropriate
example section, showing how to use those parameters.
Fixes: #14848Fixes: #14255
The production image did not have root group set as default for
the airflow user. This was not a big problem unless you extended
the image - in which case you had to change the group manually
when copying the images in order to keep the image OpenShift
compatible (i.e. runnable with any user and root group).
This PR fixes it by changing default group of airflow user
to root, which also works when you extend the image.
```
Connected.
airflow@53f70b1e3675:/opt/airflow$ ls
dags logs
airflow@53f70b1e3675:/opt/airflow$ cd dags/
airflow@53f70b1e3675:/opt/airflow/dags$ ls -l
total 4
-rw-r--r-- 1 airflow root 1648 Mar 22 23:16 test_dag.py
airflow@53f70b1e3675:/opt/airflow/dags$
```
There are many more references to "master" (even in our own repo) than
this, but this commit is the first step: to that process.
It makes CI run on the main branch (once it exists), re-words a few
cases where we can to easily not refer to master anymore.
This doesn't yet re-name the `constraints-master` or `master-*` images -
that will be done in a future PR.
(We don't be able to entirely eliminate "master" from our repo as we
refer to a lot of other GitHub repos that we can't change.)
(cherry picked from commit 0dea083fcb)
In order to optimize the Docker image, we use the ~/.local
folder copied from build imge (this gives huge optimisations
regarding the docker image size). So far we instructed the users
to add --user flag manually when installing any packages when they
extend the images, however this has proven to be problematic as
users rarely read the whole documentation and simply try what they
know.
This PR attempts to fix it. `PIP_USER` variable is set to `true`
in the final image, which means that the installation by default
will use ~/.local folder as target. This can be disabled by
unsetting the variable or setting it to `false`.
Also since pylint version has been released to 2.7.0, it fixes
a few pylint versions so that we can update to the latest constraints.
(cherry picked from commit ca35bd7f7f)
* Update hadolint from v1.22.1 to v1.18.0
* fixup! Update hadolint from v1.22.1 to v1.18.0
* fixup! fixup! Update hadolint from v1.22.1 to v1.18.0
Co-authored-by: Kamil Breguła <kamilbregula@apache.org>
(cherry picked from commit cc7260a9e8)
There are two types of constraints now:
* default constraints that contain all depenedncies of airflow,
all the provider packages released at the time of the relese
of that version, as well as all transitive dependencies. Following
those constraints, you can be sure Airflow's installation is
repeatable
* no-providers constraints - containing only the dependencies needed
for core airflow installation. This allows to install/upgrade
airflow without also forcing the provider's to be installed at
specific version of Airflow.
This allows for flexible management of Airflow and Provider
packages separately. Documentation about it has been added.
Also the provider 'extras' for apache airflow do not keep direct
dependencies to the packages needed by the provider. Those
dependencies are now transitive only - so 'provider' extras only
depend on 'apache-airflow-provider-EXTRA' package and all
the dependencies are transitive. This will help in the future
to avoid conflicts when installing newer providers using extras.
(cherry picked from commit d524cec99d)
When the image is prepared, PIP installation produces progress
bars which are annoying - especially in the CI environment.
This PR adds argument to control progress bar and sets it to "off"
for CI builds.
(cherry picked from commit 9b7852e047)
Revert "Fix Commands to install Airflow in docker/install_airflow.sh (#14099)"
This reverts commit 68758b8260.
Also fixes the docker build script that was the reason for original
attempt to fix it.
(cherry picked from commit 212d5cd315)
* Adds capability of switching to Github Container Registry
Currently we are using GitHub Packages to cache images for the
build. GitHub Packages are "legacy" storage of binary artifacts
for GitHub and as of September 2020 they introduced Github
Container Registry as more stable, easier to manage replacement
for container storage. It includes complete self-management
of the images including permission management, public access,
retention management and many more.
More about it here:
https://github.blog/2020-09-01-introducing-github-container-registry/
Recently we started to experience unstable behaviour of the
Github Packages ('unknown blob' and manifest v1 vs. v2 when
pushing images to it. So together with ASF we proposed to
enable Github Container Registry and it happened as of
January 2020.
More about it in https://issues.apache.org/jira/browse/INFRA-20959
We are currently in the testing phase, especially when it
comes to management of permissions - the model of permission
mangement is not the same for Container Registry as it was
for GitHub Packages (it was per-repository in GitHub Packages,
but it is organization-wide in the Container Registry.
This PR introduces an option to use GitHub Container Registry
rather than GitHub Packages. It is implemented in both - CI
level and Breeze level allowing to seamlessly switch between
those two solutions:
In Breeze (which we use to test pushing/pulling the images)
--github-registry option was added with `ghcr.io` (Github Container
Registry) or `docker.pkg.github.com` (GitHub Packages).
In CI the same can be achieved by setting GITHUB_REGISTRY value
(same values possible as for --github-registry Breeze parameter)
* fixup! Adds capability of switching to Github Container Registry
(cherry picked from commit 2c6c7fdb23)
We've introduced chmod a+x for installation scripts in Dockerfiles.
but this turned out to be a bad idea. This was to accomodate
building on Azure Deveops which has filesystem that does not
keep executable bit. But the side-effect of it that the
layer of the script is invalidated when the permission is changed
to +x on linux. The problem is that the script has locally (on
checkout) different permissions depending on umask setting.
Therefore changing permissions for the image to +a is not best.
Instead we are running the scripts with bash directly, which does
not require changing of executable bit.
(cherry picked from commit 18d9320c26)
When production image is built for development purpose, by default
it installs all providers from sources, but not all dependencies
are installed for all providers. Many providers require more
dependencies and when you try to import those packages via
provider's manager, they fail to import and print warnings.
Those warnings are now turned into debug messages, in case
AIRFLOW_INSTALLATION_METHOD=".", which is set when
production image is built locally from sources. This is helpful
especially when you use locally build production image to
run K8S tests - otherwise the logs are flooded with
warnings.
This problem does not happe in CI, because there by default
production image is built from locally prepared packages
and it does not contain sources from providers that are not
installed via packages.
(cherry picked from commit f74da5025d)
This should allow us to release a new version of snowflake
provider that is not interacting with other providers via
monkeypatching of SSL classes.
Fixes#12881
(cherry picked from commit 6e90dfc38b)
This change removes the provider-imposed requirements from the
airflow's setup.cfg to additional configuration in the
breeze/CI scripts. This does not change constraint apprach
when installing airflow, the constraints to those versions
remain as they were, but airflow package does not have to
have the limits in 'install_requires' section which makes
it much more "standalone.
We can add more requirements there as needed or remove
them when provider's dependencies change.
Also thanks to using --upgrade-to-newer-dependencies flag in
Breeze, the instructions on what to do when there is
a problem with conflicting dependencies are much simpler.
You do not need any more to set the label in PR
to test how upgrade to newer dependencies will look like,
you can test it yourself locally.
This is a final step of making airflow package fully
independent from the provider's dependencies.
(cherry picked from commit f49f36b6a0)
Airflow and provider packages need to be installed together to
make sure that constrainst are taken into account and that airflow
does not get reinstalled from PyPI when eager upgrade runs.
(cherry picked from commit bc6f5ea088)
In the latest change #13422 change in the way product images are
prepared removed extras from installed airflow - thus caused
failing production image verification check.
This change restores extras when airflow is installed from packages
(cherry picked from commit 3a731108f5)
This PR improves building production image from local packages,
in preparation for moving provider requirements out of setup.cfg.
Previously `pip download` step was executed in the CI scripts
in order to download all the packages that were needed. However
this had two problems:
1) PIP download was executed outside of Dockerfile in CI scripts
which means that any change to requirements there could not
be executed in 'workflow_run' event - because main branch version
of CI scripts is used there. We want to add extra requirements
when installing airflow so in order to be able to change
it, those requirements should be added in Dockerfile.
This will be done in the follow-up #13409 PR.
2) Packages downloaded with PIP download have a "file" version
rather than regular == version when you run pip freeze/check.
This looks weird and while you can figure out the version
from file name, when you `pip install` them, they look
much more normal. The airflow package and provider package
will still get the "file" form but this is ok because we are
building those packages from sources and they are not yet
available in PyPI.
Example:
adal==1.2.5
aiohttp==3.7.3
alembic==1.4.3
amqp==2.6.1
apache-airflow @ file:///docker-context-files/apache_airflow-2.1.0.dev0-py3-none-any.whl
apache-airflow-providers-amazon @ file:///docker-context-files/apache_airflow_providers_amazon-1.0.0-py3-none-any.whl
apache-airflow-providers-celery @ file:///docker-context-files/apache_airflow_providers_celery-1.0.0-py3-none-any.whl
...
With this PR, we do not `pip download` all packages, but instead
we prepare airflow + providers packages as .whl files and
install them from there (all the dependencies are installed
from PyPI)
(cherry picked from commit e436883583)
Previously UPGRADE_TO_LATEST_CONSTRAINTS variable controlled
whether the CI image uses latest dependencies rather than
fixed constraints. This PR brings it also to PROD image.
The name of the ARG is changed to UPGRADE_TO_NEWER_DEPENDENCIES
as this corresponds better with the intention.
(cherry picked from commit 82fa048c12)
It seems that for quite some time (1.10.4) the "ldap" extra
missed python-ldap dependency.
https://issues.apache.org/jira/browse/AIRFLOW-5261
Also LDAP seems to be popular enough to be added as default
extra in the production image.
Fixes#13306
(cherry picked from commit d23ac9b235)
Some older versions of PIP (including the one in dockerhub!) treat
all env variables starting with PIP_ as a way to pass
options. Setting PIP_VERSION to 20.2.4 and exporting it causes
error "ValueError: invalid truth value '20.2.4'" because it
does not have --version option and it treats it as --verbose
¯\_(ツ)_/¯
You can read more about it here:
https://github.com/pypa/pip/issues/4528
This PR renames the variable to avoid this side effect.
(cherry picked from commit 8fed541192)
* Install airflow and providers from dist and verifies them
This check is there to prevent problems similar to those reported
in #13027 and fixed in #13031.
Previously we always built airflow from wheels, only providers were
installed from sdist packages and tested. In this version both
airflow and providers are installed using the same package format
(sdist or wheel).
* Update scripts/in_container/entrypoint_ci.sh
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
* Changes release image preparation to use PyPI packages
Since we released all teh provider packages to PyPI now in
RC version, we can now change the mechanism to prepare the
production to use released packages in case of tagged builds.
The "branch" production images are still prepared using the
CI images and .whl packages built from sources, but the
release packages are built from officially released PyPI
packages.
Also some corrections and updates were made to the release process:
* the constraint tags when RC candidate is sent should contain
rcn suffix.
* there was missing step about pushing the release tag once the
release is out
* pushing tag to GitHub should be done after the PyPI packages
are uploaded, so that automated image building in DockerHub
can use those packages.
* added a note that in case we will release some provider
packages that depend on the just released airflow version
they shoudl be released after airflow is in PyPI but before
the tag is pushed to GitHub (also to allow the image to be
build automatically from the released packages)
Fixes: #12970
* Update dev/README_RELEASE_AIRFLOW.md
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
* Update dev/README_RELEASE_AIRFLOW.md
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Fix permission issue in Azure DevOps when running the script install_mysql.sh, which prevents the build to succeed
/bin/bash: ./scripts/docker/install_mysql.sh: Permission denied
The command '/bin/bash -o pipefail -e -u -x -c ./scripts/docker/install_mysql.sh dev' returned a non-zero code: 126
##[error]The command '/bin/bash -o pipefail -e -u -x -c ./scripts/docker/install_mysql.sh dev' returned a non-zero code: 126
##[error]The process '/usr/bin/docker' failed with exit code 126
* Apply labels to Docker images in a single instruction
While looking at the build logs for something else I noticed this
oddity at the end of the CI logs:
```
Tue, 08 Dec 2020 21:20:19 GMT Step 125/135 : LABEL org.apache.airflow.distro="debian"
...
Tue, 08 Dec 2020 21:21:14 GMT Step 133/135 : LABEL org.apache.airflow.commitSha=${COMMIT_SHA}
Tue, 08 Dec 2020 21:21:14 GMT ---> Running in 1241a5f6cdb7
Tue, 08 Dec 2020 21:21:21 GMT Removing intermediate container 1241a5f6cdb7
```
Applying all the labels took 1m2s! Hopefully applying these in a single
layer/command should speed things up.
A less extreme example still took 43s
```
Tue, 08 Dec 2020 20:44:40 GMT Step 125/135 : LABEL org.apache.airflow.distro="debian"
...
Tue, 08 Dec 2020 20:45:18 GMT Step 133/135 : LABEL org.apache.airflow.commitSha=${COMMIT_SHA}
Tue, 08 Dec 2020 20:45:18 GMT ---> Running in dc601207dbcb
Tue, 08 Dec 2020 20:45:23 GMT Removing intermediate container dc601207dbcb
Tue, 08 Dec 2020 20:45:23 GMT ---> 5aae5dd0f702
```
* Update Dockerfile
So far, the production images of Airflow were using sources
when they were built on CI. This PR changes that, to build
airflow + providers packages first and install them
rather than use sources as installation mechanism.
Part of #12261
We do not need to add docker-context-files in CI before we run
first "cache" PIP installation. Adding it might cause the effect
that the cache will always be invalidated in case someone has
a file added there before building and pushing the image.
This PR fixes the problem by adding docker-context files later
in the Dockerfile and changing the constraints location
used in the "cache" step to always use the github constraints in
this case.
Closes#12509
If you used context from git repo, the .piprc file was missing and
COPY in Dockerfile is not conditional.
This change copies the .pypirc conditionally from the
docker-context-files folder instead.
Also it was needlessly copied in the main image where it is not
needed and it was even dangerous to do so.
There was a typo in the original file when review was made in
the #11529 but apparently this typo was still left in one place
and as the result, providers have not been installed in the
master Dockerfile.
Fixes#11695
In Airflow 2.0 we decided to split Airlow into separate providers.
this means that when you prepare core airflow package, providers
are not installed by default. This is not very convenient for
local development though and for docker images built from sources,
where you would like to install all providers by default.
A new INSTALL_ALL_AIRFLOW_PROVIDERS environment variable controls
this behaviour now. It is is set to "true", all packages including
provider packages are installed. If missing or set to false, only
the core provider package is installed.
For Breeze, the default is set to "true", as for those cases you
want to install all providers in your environment. Similarly if you
build the production image from sources. However when you build
image using github tag or pip package, you should specify
appropriate extras to install the required provider packages.
Note that if you install Airflow via 'pip install .' from sources
in local virtualenv, provider packages are not going to be
installed unless you set INSTALL_ALL_AIRFLOW_PROVIDERS to "true".
Fixes#11489
The production image had the capability of installing images from
wheels (for security teams/air-gaped systems). This capability
might also be useful when building CI image espeically when
we are installing separately core and providers packages and
we do not yet have provider packages available in PyPI.
This is an intermediate step to implement #11490
* Add capability of customising PyPI sources
This change adds capability of customising installation of PyPI
modules via custom .pypirc file. This might allow to install
dependencies from in-house, vetted registry of PyPI
* Constraints and PIP packages can be installed from local sources
This is the final part of implementing #11171 based on feedback
from enterprise customers we worked with. They want to have
a capability of building the image using binary wheel packages
that are locally available and the official Dockerfile. This means
that besides the official APT sources the Dockerfile build should
not needd GitHub, nor any other external files pulled from outside
including PIP repository.
This change also includes documentation on how to prepare set of
such binaries ready for inspection and review by security teams
in Enterprise environment. Such sets of "known-working-binary-whl"
files can then be separately committed, tracked and scrutinized
in an artifact repository of such an Enterprise.
Fixes: #11171
* Update docs/production-deployment.rst
* Allows more customizations for image building.
This is the third (and not last) part of making the Production
image more corporate-environment friendly. It's been prepared
for the request of one of the big Airflow user (company) that
has rather strict security requirements when it comes to
preparing and building images. They are committed to
synchronizing with the progress of Apache Airflow 2.0 development
and making the image customizable so that they can build it using
only sources controlled by them internally was one of the important
requirements for them.
This change adds the possibilty of customizing various steps in
the build process:
* adding custom scripts to be run before installation of both
build image and runtime image. This allows for example to
add installing custom GPG keys, and adding custom sources.
* customizing the way NodeJS and Yarn are installed in the
build image segment - as they might rely on their own way
of installation.
* adding extra packages to be installed during both build and
dev segment build steps. This is crucial to achieve the same
size optimizations as the original image.
* defining additional environment variables (for example
environment variables that indicate acceptance of the EULAs
in case of installing proprietary packages that require
EULA acceptance - both in the build image and runtime image
(again the goal is to keep the image optimized for size)
The image build process remains the same when no customization
options are specified, but having those options increases
flexibility of the image build process in corporate environments.
This is part of #11171.
This change also fixes some of the issues opened and raised by
other users of the Dockerfile.
Fixes: #10730Fixes: #10555Fixes: #10856
Input from those issues has been taken into account when this
change was designed so that the cases described in those issues
could be implemented. Example from one of the issue landed as
an example way of building highly customized Airflow Image
using those customization options.
Depends on #11174
* Update IMAGES.rst
Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
This is the second step of making the Production Docker Image more
corporate-environment friendly, by making MySQL client installation
optional. Instaling MySQL Client on Debian requires to reach out
to oracle deb repositories which might not be approved by security
teams when you build the images. Also not everyone needs MySQL
client or might want to install their own MySQL client or MariaDB
client - from their own repositories.
This change makes the installation step separated out to
script (with prod/dev installation option). The prod/dev separation
is needed because MySQL needs to be installed with dev libraries
in the "Build" segment of the image (requiring build essentials
etc.) but in "Final" segment of the image only runtime libraries
are needed.
Part of #11171
Depends on #11173.
This is first step of implementing the corporate-environment
friendly way of building images, where in the corporate
environment, this might not be possible to install the packages
using the GitHub cache initially.
Part of #11171
Airflow below 1.10.2 required SLUGIFY_USES_TEXT_UNIDECODE env
variable to be set to yes.
Our production Dockerfile and Breeze supports building images
for any version of airflow >= 1.10.1 but it failed on
1.10.2 and 1.10.1 because this variable was not set.
You can now set the variable when building image manually
and Breeze does it automatically if image is 1.10.1 or 1.10.2
Fixes#10974
Since we are running the airflow image as airflow user, the
entrypoint and clear-logs scripts should also be set as airflow.
This had no impact if you actually run this as root user or
when your group was root (which was recommended).
After #10368, we've changed the way we build the images
on CI. We are overriding the ci scripts that we use
to build the image with the scripts taken from master
to not give roque PR authors the possibiility to run
something with the write credentials.
We should not override the in_container scripts, however
because they become part of the image, so we should use
those that came with the PR. That's why we have to move
the "in_container" scripts out of the "ci" folder and
only override the "ci" folder with the one from
master. We've made sure that those scripts in ci
are self-contained and they do not need reach outside of
that folder.
Also the static checks are done with local files mounted
on CI because we want to check all the files - not only
those that are embedded in the container.
The EMBEDDED dags were only really useful for testing
but it required to customise built production image
(run with extra --build-arg flag). This is not needed
as it is better to extend the image instead with FROM
and add dags afterwards. This way you do not have
to rebuild the image while iterating on it.
* Constraint files are now maintained automatically
* No need to generate requirements when setup.py changes
* requirements are kept in separate orphan branches not in main repo
* merges to master verify if latest requirements are working and
push tested requirements to orphaned branches
* we keep history of requirement changes and can label them
individually for each version (by constraint-1.10.n tag name)
* consistently changed all references to be 'constraints' not
'requirements'