closes https://github.com/apache/airflow/issues/14327
When using `KubernetesExecutor` and the task as follows:
```python
PythonOperator(
task_id=f"sync_{table_name}",
python_callable=sync_table,
provide_context=True,
op_kwargs={"table_name": table_name},
executor_config={"KubernetesExecutor": {"request_cpu": "1"}},
retries=5,
dag=dag,
)
```
it breaks the UI as settings resources in such a way is only there
for backwards compatibility.
This commits fixes it.
(cherry picked from commit 7b577c35e2)
The PROD image of airflow is OpenShift compatible and it can be
run with either 'airflow' user (UID=50000) or with any other
user with (GID=0).
This change adds umask 0002 to make sure that whenever the image
is extended and new directories get created, the directories are
group-writeable for GID=0. This is added in the default
entrypoint.
The entrypoint will fail if it is not run as airflow user or if
other, arbitrary user is used with GID != 0.
Fixes: #15107
(cherry picked from commit ce91872ecc)
While checking the test status of various CI tests we came to
conclusion that Presto integration took a lot of memory (~1GB)
and was the main source of failures during integration tests,
especially with MySQL8. The attempt to fine-tune the memory
used turned out in the discovery, that Presto DB stopped
publishing their Docker image (prestosql/presto) - apparently
after the aftermath of splitting-off Trino from Presto.
Th split-off was already discussed in #14281 and it was planned
to add support for Trino (which is the more community-driven
fork of the Presto - Presto remained at Facebook Governance,
where Trino is an effort continued by the original creators.
You can read more about it in the announcement:
https://trino.io/blog/2020/12/27/announcing-trino.html. While
Presto continues their way under The Linux Foundation, Trino
lives its own live and keeps on maintaining all artifacts and
libraries (including the image). That allowed us to update
our tests and decrease the memory footprint by around 400MB.
This commit:
* adds the new Trino provider
* removes `presto` integration and replaces it with `trino`
* the `trino` integartion image is built with 400MB less memory
requirementes and published as `apache/airflow:trino-*`
* moves the integration tests from Presto to Trino
Fixes: #14281
(cherry picked from commit eae22cec9c)
Originally, the constraints were generated in separate jobs and uploaded as
artifacts and then joined be a separate push job. Thanks to parallel
processing, we can now do that all in a single job, with both cost and
time savings.
(cherry picked from commit aebacd7405)
This PR sets Pythong 3.6 specific limits for some of the packages
that recently dropped support for Python 3.6 binary packages
released via PyPI. Even if those packages did not drop the
Python 3.6 support entirely, it gets more and more difficult to
get those packages installed (both locally and in the Docker image)
because the require the packages to be compiled and they often
require a number of external dependencies to do so.
This makes it difficult to automatically upgrade dependencies,
because such upgrade fails for Python 3.6 images if we attempt
to do so.
This PR limits several of those dependencies (dask/pandas/numpy)
to not use the lates major releases for those packages but limits
them to the latest released versions.
Also comment/clarification was added to recently (#15114) added limit
for `pandas-gbq`. This limit has been added because of broken
import for bigquery provider, but the comment about it was missing
so the comment is added now.
(cherry picked from commit e49722859b)
K8S has a one-year support policy. This PR updates the
K8S versions we use to test to the latest available in three
supported versions of K8S as of now: 1.20, 1.19. 18.
The 1.16 and 1.17 versions are not supported any more as of today.
https://en.wikipedia.org/wiki/Kubernetes
This change also bumps kind to latest version (we use kind for
K8S testing) and fixes configuration to match this version.
(cherry picked from commit 36ab9dd7c4)
This PR fixes a problem introduced by #14144
This is a very weird and unforeseen issue. The change introduced a
new import from flask `before_render_template` and this caused
flask to require `blinker` dependency, even if it was not
specified before as 'required' by flask. We have not seen it
before, because changes to this part of the code do not trigger
K8S tests, however subsequent PRs started to fail because
the setup.py did not have `blinker` as dependency.
However in CI image `blinker` was installed because it is
needed by sentry. So the problem was only detectable in the
production image.
This is an ultimate proof that our test harness is really good in
catchig this kind of errors.
The root cause for it is described in
https://stackoverflow.com/questions/38491075/flask-testing-signals-not-supported-error
Flask support for signals is optional and it does not blinker as
dependency, but importing some parts of flask triggers the need
for signals.
(cherry picked from commit 437850bd16)
The 'wheel' package installation tests all options
comprehensively - including preparing documentation
and installing on Airflow 2.0.
The 'sdist' package installation takes longer (because
the packages are converted to wheels on-the-fly by
pip), so only basic installation is tested (the rest
is the same as in case of wheel packages)
(cherry picked from commit 2e8aa0d109)
Sometimes when docs are building in parallel, it takes longer
than 4 minutes to build a big package and the job fails with
timeout.
This change increases the individual package build timeout to
be longer (8 minutes instead of 4)
(cherry picked from commit 95ae24a953)
Not all docker commands are replaced with functions now.
Earlier wer replaced all docker commands with a function to be able
to capture docker commands used and display it with -v for breeze.
This has proven to be harmful as
this is an unexpected behaviour for a docker command.
This change introduces docker_v command which outputs the command
when needed.
(cherry picked from commit 535e1a8e69)
Some of the test jobs are hanging - either becasue of some
weird race conditions in docker or because the test hangs (happens
for quarantined tests). This change add maximum timeout we let
the test suite execute to 25 minutes.
(cherry picked from commit a4aee3f1d0)
This is far more complex than it should be because of
autoapi problems with parallel execution. Unfortunately autoapi
does not cope well when several autoapis are run in parallel on
the same code - even if they are run in separate processes and
for different packages. Autoapi uses common _doctree and _api
directories generated in the source code and they override
each other if two or more of them run in parallel.
The solution in this PR is mostly applicable for CI environment.
In this case we have docker images that have been already built
using current sources so we can safely run separate docker
containers without mapping the sources and run generation
of documentation separtely and independently in each container.
This seems to work really well, speeding up docs generation
2x in public GitHub runners and 8x in self-hosted runners.
Public runners:
* 27m -> 15m
Self-hosted runners:
* 27m -> < 8m
(cherry picked from commit 741a54502f)
The fix for this was very easy -- just a `timer` -> `timed` typo.
However it turns out that the tests for airflow.stats were insufficient
and didn't catch this, so I have extended the tests in two ways:
1. Test all the other stat methods than just incr (guage, timer, timing,
decr)
2. Use "auto-specing" feature of Mock to ensure that we can't make up
methods to call on a mock object.
> Autospeccing is based on the existing spec feature of mock.
> It limits the api of mocks to the api of an original object (the
> spec), but it is recursive (implemented lazily) so that attributes of
> mocks only have the same api as the attributes of the spec. In
> addition mocked functions / methods have the same call signature as
> the original so they raise a TypeError if they are called
> incorrectly.
(cherry picked from commit b7cd2df056)
There have been long standing issues where the scheduler would "stop
responding" that we haven't been able to track down.
Someone was able to catch the scheduler in this state in 2.0.1 and
inspect it with py-spy (thanks, MatthewRBruce!)
The stack traces (slightly shortened) were:
```
Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:411)
send (multiprocessing/connection.py:206)
send_callback_to_execute (airflow/utils/dag_processing.py:283)
_send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
_schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
Process 77: airflow scheduler -- DagFileProcessorManager
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:405)
send (multiprocessing/connection.py:206)
_run_parsing_loop (airflow/utils/dag_processing.py:698)
start (airflow/utils/dag_processing.py:596)
```
What this shows is that both processes are stuck trying to send data to
each other, but neither can proceed as both buffers are full, but since
both are trying to send, neither side is going to read and make more
space in the buffer. A classic deadlock!
The fix for this is two fold:
1) Enable non-blocking IO on the DagFileProcessorManager side.
The only thing the Manager sends back up the pipe is (now, as of 2.0)
the DagParsingStat object, and the scheduler will happily continue
without receiving these, so in the case of a blocking error, it is
simply better to ignore the error, continue the loop and try sending
one again later.
2) Reduce the size of DagParsingStat
In the case of a large number of dag files we included the path for
each and every one (in full) in _each_ parsing stat. Not only did the
scheduler do nothing with this field, meaning it was larger than it
needed to be, by making it such a large object, it increases the
likely hood of hitting this send-buffer-full deadlock case!
(cherry picked from commit b0e68ebcb8)
We do a lot of path manipulation in this test file, and it's easier to
understand by using pathlib without all the nested `os.path.*` calls.
This change adds "support" for passing Path objects to DagBag and
util functions.
(cherry picked from commit 6e99ae0564)
This was mistakenly removed in the HA scheduler refactor work.
It is now added back, and has tests this time so we will notice if it
breaks in future.
By using freezegun we can assert the _exact_ of the metric emitted to
make sure it also has the correct value without introducing in
timing-based flakiness.
(cherry picked from commit ca4c4f3d34)
i.e. to not support filtering by 'conf' column in DagRun View.
This cannot be supported because FAB uses ILIKE under the hood,
which is not supported by 'bytea' type in Postgres or 'BLOB' in SQLite.
Closes issue #14374
(cherry picked from commit 3585b3c54c)
This document used the now-deprecated import for "task";
this updates it to come from `airflow.decorators`
so it won't raise a DeprecationWarning if copied and used.
(cherry picked from commit 1521b96657)
Currently the default value for namespace is always 'default'.
However, `conf.get('kubernetes', 'namespace')` may be a more proper
default value for namespace in this case
(cherry picked from commit b8cf46a12f)
Clarify that the `delete_worker_pods_on_failure` flag only applies to worker failures, not task failures as well.
(cherry picked from commit 7c2ed5394e)
I found this when investigating why the delete_worker_pods_on_failure flag wasn't working. The feature has sufficient test coverage, but doesn't fail simply because the strings have the same id when running in the test suite, which is exactly what happens in practice.
flake8/pylint also don't seem to raise their respective failures unless one side it literally a literal string, even though typing is applied 🤷♂️.
I fixed 2 other occurrences I found while I was at it.
(cherry picked from commit 6d30464319)
Any schedulers depending on the queue functionality that haven't overridden
`trigger_tasks` method will see queue functionality break when upgrading to 2.0
(cherry picked from commit 375d26d880)
This makes a handful of bigger queries instead of many queries when
syncing the default Airflow roles. On my machine with 5k DAGs, this led
to a reduction of 1 second in startup time (bonus, makes tests faster
too).
(cherry picked from commit 1627323a19)
This fixes a short circuit in `create_dag_specific_permissions` to
avoid needlessly querying permissions for every single DAG, and changes
`get_all_permissions` to run 1 query instead of many.
With ~5k DAGs, these changes speed up `create_dag_specific_permissions`
by more than 65 seconds each call (on my machine), and since that method
is called twice before the webserver actually responds to requests, this
effectively speeds up the webserver startup by over 2 minutes.
(cherry picked from commit 35fbb72649)
If you have `from airflow.contrib.operators.emr_add_steps_operator
import EmrAddStepsOperator` line in your DAG file, you get three
warnings for this one line
```
/home/ash/airflow/dags/foo.py:3 DeprecationWarning: This module is deprecated.
/home/ash/airflow/dags/foo.py:3 DeprecationWarning: This package is deprecated. Please use `airflow.operators` or `airflow.providers.*.operators`.
/home/ash/airflow/dags/foo.py:3 DeprecationWarning: This module is deprecated. Please use `airflow.providers.amazon.aws.operators.emr_add_steps`.
```
All but the last is not helpful.
* Upgrades moto to newer version (~=2.0)
According to https://github.com/spulec/moto/issues/3535#issuecomment-808706939
1.3.17 version of moto with a fix to be compatible with mock> 4.0.3 is
not going to be released because of breaking changes. Therefore we need
to migrate to newer version of moto.
At the same time we can get rid of the old botocore limitation, which
was added apparently to handle some test errors. We are relying fully
on what boto3 depends on.
Upgrading dependencies also discovered that mysql tests need to
be fixed because upgraded version of dependencies cause some test
failure (those turned out to be badly written tests).
* Adds dill exclusion to Dockerfiles to accomodate upcoming beam fix
With the upcoming apache-beam change where mock library will be
removed from install dependencies, we will be able to remove
`apache-beam` exclusion in our CI scripts. This will be a final
step of cleaning dependencies so that we have a truly
golden set of constraints that will allow to install airflow
and all community managed providers (we managed to fix all those
dependency issues for all packages but apache-beam).
The fix https://github.com/apache/beam/pull/14328 when merged
and Apache Beam is released will allow us to migrate to the new
version and get rid of the CI exclusion for beam.
Closes: #14994
(cherry picked from commit ec962b01b7)
According to https://github.com/spulec/moto/issues/3535#issuecomment-808706939
1.3.17 version of moto with a fix to be compatible with mock> 4.0.3 is
not going to be released because of breaking changes. Therefore we need
to migrate to newer version of moto.
At the same time we can get rid of the old botocore limitation, which
was added apparently to handle some test errors. We are relying fully
on what boto3 depends on.
Upgrading dependencies also discovered that mysql tests need to
be fixed because upgraded version of dependencies cause some test
failure (those turned out to be badly written tests).
(cherry picked from commit e8aa3de4bb)
Documentation update for the four previously excluded providers that
got extra fixes/bumping to the latest version of the libraries.
* apache.beam
* apache.druid
* microsoft.azure
* snowflake
(cherry picked from commit b753c7fa60)
The key used to remove a task from executor.running is reconstituted
from pod annotations, so make sure the full dag_id and task_id are in
the annotations.
(cherry picked from commit b5e7ada345)
* Fixes problem with two different files mdsumed with the same name
When we check whether we should rebuild image, we check if the
md5sum of some important files changed - which would trigger
question whether to rebuild the image or not (because of
changed dependencies which need to be installed). This
happens for example when package.json or yarn.lock changes.
Previously, all the important files had distinct names, so
we stored the md5 hashes of those files with just filenames +.md5sum
but they were flattened to a single directory. Unfortunately,
as of #14927 (merged with failing build) we had two package.json
and two yarn.locks and it caused overwriting of md5hash of one
by the other. This triggered unnecessary rebuilding of the image
in CI part which resulted in failure (because of Apache Beam
dependency problem).
This PR fixes it by adding parent directory to the name of
the md5sum file (so we have www-package.json and ui-package.json)
now. Those important files change very rarely so this incident
should not happen again but we added some comments preventing
it.
* Update scripts/ci/libraries/_initialization.sh
Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>
(cherry picked from commit 775ee51d0e)
The change #14911 had a bug - when PYTHON_MINOR_MAJOR_VERSION
was removed from the imge args, the replacement `python -m site`
expression missed `/airflow/` suffix. Unfortunately it was not
flagged as an error because the recompiling script silently
skipped recompilation step in such case.
This change:
* fixes the error
* removes the silent-skipping if check (the recompilation will
fail in case it is wrongly set)
* adds check at image verification whether dist/manifest.json is
present.
Fixes: #14991
This PR skips building Provider packages for branches different
than master. Provider packages are always released from master
but never from any other branch so there is no point in running
the package building and tests there
(cherry picked from commit df368f17df361af699dc868af9481ddc3abf0416)