The region parameter is required for some of Google Dataproc operators
and it should be provided by users to avoid creating data-intensive
tasks in any default location.
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (https://github.com/apache/airflow/issues/10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.
There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.
Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.
Or simply those tests are more stable when run isolation.
We might still add failing tests back as soon we see them behave
in a flaky way.
The remaining quarantined tests that need to be fixed:
* test_local_run (often hangs the build)
* test_retry_handling_job
* test_clear_multiple_external_task_marker
* test_should_force_kill_process
* test_change_state_for_tis_without_dagrun
* test_cli_webserver_background
We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
* TestImpersonation tests
We might find that those heisentest can be fixed but for
now we are going to run them in isolation.
Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.
DataprocCreateCluster requires now:
- cluster config
- cluster name
- project id
In this way users don't have to pass project_id two times
(in cluster definition and as parameter). The cluster object
is built in create_cluster hook method
BATS has additional libraries of asserts that are much more
straightforward and nicer to write tests for bash scripts
There is no dockerfile from BATS that contains those, so we
had to build our own (but it follows the same structure
as #9652 - where we keep our dev docker image
sources inside our repository and the generated docker images
in "apache/airflow:<tool>-CALVER-TOOLVER format.
We have more BATS unit test to add - following #10576
and this change will be of great help.
If we run this test
(TestTriggerRuleDep::test_get_states_count_upstream_ti specifically)
more than once without clearing the DB in between it would fail due to a
unique constraint violation.
The `@provide_session` wrapper will already commit the transaction when
returned, unless an explicit session is passed in -- removing this
parameter changes the behaviour to be:
- If session explicitly passed in: don't commit (caller's
responsibility)
- If no session passed in, `@provide_session` will commit for us already.
Add jupytercmd in Qubole Operator which fires a JupyterNotebookCommand to the jupyter notebooks running on user's QDS account. Along with this, we have fixed a minor bug that caused the tasks to fail with --notify is set in Qubole Operator.
Co-authored-by: Aaditya Sharma <asharma@qubole.com>
Inspired by the Google Shell Guide where they mentioned
separating package names with :: I realized that this was
one of the missing pieces in the bash scripts of ours.
While we already had packages (in libraries folders)
it's been difficult to realise which function is where.
With introducing packages - equal to the library file name
we are *almost* at a level of a structured language - and
it's easier to find the functions if you are looking for them.
Way easier in fact.
Part of #10576
(cherry picked from commit cc551ba793)
(cherry picked from commit 2bba276f0f06a5981bdd7e4f0e7e5ca2fe84f063)
* Implement Google Shell Conventions for breeze script … (#10651)
Part of #10576
First (and the biggest of the series of commits to introduce
Google Shell Conventions in our bash scripts.
This is about the biggest and the most complex breeze script
so it is rather huge but it is difficult to split it into
smaller pieces.
The rules implemented (from the conventions):
* constants and exported variables are CAPITALIZED, where
local/temporary variables are lowercase
* following the shell guide, once all the variables are set to their
final values (either from exported variables, calculation or --switches
) I have a single function that makes all the variables read-only. That
helped to clean-up a lot of places where same functions was called
several times, or where variables were defined in a few places. Now the
behavior should be rather consistent and we should easily catch some
duplications
* function headers (following the guide) explaining arguments,
variables expected, variables modified in the functions used.
* setting the variables as read-only also helped to clean-up the "ifs"
where we often had ":=}" in variables and != "" or == "". Those are
replaced with `=}` and tests are replaced with `-n` and `-z` - also
following the shell guide (readonly helped to detect and clean all
such cases). This also should be much more robust in the future.
* reorganized initialization of those constants and variables - simplified
a few places where initialization was overlapping. It should be much more
straightforward and clean now
* a number of internal function breeze variables are "local" - this is
helpful in accidental variables overwriting and keeping stuff localized
* trap_add function is separated out to help in cases where we had
several traps handling the same signals.
(cherry picked from commit 46c8d6714c)
(cherry picked from commit c822fd7b4bf2a9c5a9bb3c6e783cbea9dac37246)
* fixup! Implement Google Shell Conventions for breeze script … (#10651)
* Revert "Add packages to function names in bash (#10670)"
This reverts commit cc551ba793.
* Revert "Implement Google Shell Conventions for breeze script … (#10651)"
This reverts commit 46c8d6714c.
Inspired by the Google Shell Guide where they mentioned
separating package names with :: I realized that this was
one of the missing pieces in the bash scripts of ours.
While we already had packages (in libraries folders)
it's been difficult to realise which function is where.
With introducing packages - equal to the library file name
we are *almost* at a level of a structured language - and
it's easier to find the functions if you are looking for them.
Way easier in fact.
Part of #10576
Part of #10576
First (and the biggest of the series of commits to introduce
Google Shell Conventions in our bash scripts.
This is about the biggest and the most complex breeze script
so it is rather huge but it is difficult to split it into
smaller pieces.
The rules implemented (from the conventions):
* constants and exported variables are CAPITALIZED, where
local/temporary variables are lowercase
* following the shell guide, once all the variables are set to their
final values (either from exported variables, calculation or --switches
) I have a single function that makes all the variables read-only. That
helped to clean-up a lot of places where same functions was called
several times, or where variables were defined in a few places. Now the
behavior should be rather consistent and we should easily catch some
duplications
* function headers (following the guide) explaining arguments,
variables expected, variables modified in the functions used.
* setting the variables as read-only also helped to clean-up the "ifs"
where we often had ":=}" in variables and != "" or == "". Those are
replaced with `=}` and tests are replaced with `-n` and `-z` - also
following the shell guide (readonly helped to detect and clean all
such cases). This also should be much more robust in the future.
* reorganized initialization of those constants and variables - simplified
a few places where initialization was overlapping. It should be much more
straightforward and clean now
* a number of internal function breeze variables are "local" - this is
helpful in accidental variables overwriting and keeping stuff localized
* trap_add function is separated out to help in cases where we had
several traps handling the same signals.
* Updated REST API call so GET requests pass payload in query string instead of request body
* Updated comparisons to use in to follow better standards
* Added whitespace for pylint failure
* Update Databricks hooks tests to reflect new payload
* Fixed trailing whitespace in unit test
Co-authored-by: Steven Yu <steven@databricks.com>
We have already fixed a lot of problems that were marked
with those, also IntelluiJ gotten a bit smarter on not
detecting false positives as well as understand more
pylint annotation. Wherever the problem remained
we replaced it with # noqa comments - as it is
also well understood by IntelliJ.
Perf_kit was a separate folder and it was a problem when we tried to
build it from Docker-embedded sources, because there was a hidden,
implicit dependency between tests (conftest) and perf.
Perf_kit is now moved to tests to be avaiilable in the CI image
also when we run tests without the sources mounted.
This is changing back in #10441 and we need to move perf_kit
for it to work.
* fix: 🐛 Wrong S3 URI on COPY query
The S3 URI on COPY query was appending the target Redshift table to the
S3 object key.
* test: 💍 Fixed typo on test query
The COPY query that the operator used is the same query the test uses.
- refactor/change azure_container_instance to use AzureBaseHook
- add info to operators-and-hooks-ref.rst
- add howto docs for connecting to azure
- add auth mechanism via json config
- add azure conn type
* Add Amazon SES hook
* Add SES Hook to operators-and-hooks documentation.
* Fix arguments for parent class constructor call (PR feedback)
* Fix indentation in operators-and-hooks documentation
* Fix mypy error for argument on call to parent class constructor
* Simplify logic on constructor (PR feedback)
* Add custom headers and other relevant options to hook
* Change pylint exception rule to apply it only to function instead of module (PR feedback)
* Fix spellcheck error
* Vendorize airflow.utils.emaail
* fixup! Vendorize airflow.utils.emaail
Co-authored-by: Kamil Breguła <kamil.bregula@polidea.com>
* Make Kubernetes tests pass locally
Currently Kuberentes tests only all pass within breeze.
This PR makes them read the local path so they can pass in any
system.
* static tests
This allows for all the kinds of verbosity we want, including
writing outputs to output files, and it also works out-of-the-box
in git-commit non-interactive shell scripts. Also as a side effect
we have mocked tools in bats tests, which will allow us to write
more comprehensive unit tests for the bash scripts of ours
(this is a long overdue task).
Part of #10368
Using the parameterized library, add unit test coverage
for JenkinsJobTriggerOperator parameters, covering parameters
as strings or as a list of strings.
* Extract get_job_state and fix poke of AwsGlueJobSensor
* Save hook and reuse in GlueJobSensor
* Add descriptions for some functions
* Fix tests according to changed function definition
* Fix too long line
* Add type hints and apply review
* Fix type error
Co-authored-by: JB Lee <jb.lee@sendbird.com>
We run this on Webserver Startup and when DAG Serialization is enabled we expect that no files are required but because of this bug the files were still looked for.
This change will allow users to throw other exceptions (namely `AirflowClusterPolicyViolation`) than `DagCycleException` as part of Cluster Policies.
This can be helpful for running checks on tasks / DAGs (e.g. asserting task has a non-airflow owner) and failing to run tasks aren't compliant with these checks.
This is meant as a tool for airflow admins to prevent user mistakes (especially in shared Airflow infrastructure with newbies) than as a strong technical control for security/compliance posture.
While doing a trigger_dag from UI, DagRun gets created first and then WebServer starts creating TIs. Meanwhile, Scheduler also picks up the DagRun and starts creating the TIs, which results in IntegrityError as the Primary key constraint gets violated. This happens when a DAG has a good number of tasks.
Also, changing the TIs array with a set for faster lookups for Dags with too many tasks.
* Pylint checks should be way faster now
Instead of running separate pylint checks for tests and main source
we are running a single check now. This is possible thanks to a
nice hack - we have pylint plugin that injects the right
"# pylint: disable=" comment for all test files while reading
the file content by astroid (just before tokenization)
Thanks to that we can also separate out pylint checks
to a separate job in CI - this way all pylint checks will
be run in parallel to all other checks effectively halfing
the time needed to get the static check feedback and potentially
cancelling other jobs much faster.
* fixup! Pylint checks should be way faster now
The DbApiHook allows for a conn_name_attr to be changed in subclasses,
however SqliteHook's `get_conn` method is always calling the main class
attribute. Find the correct attribute and use this to establish the
connection.
Allow attr setting outside init for test case
Closes#10147
Understanding that it is an attribute name, which could have downstream
consequences, correct the spelling of max_retries and reword some of the
docstring.
In spark 3 they log the exit code with a lowercase
e, in spark 2 they used an uppercase E.
Also made the exception a bit clearer when running
on kubernetes.
Support for getting current context at any code location that runs
under the scope of BaseOperator.execute function. This functionality
is part of AIP-31.
Co-authored-by: Jonathan Shir <jonathan.shir@databand.ai>
Before this change, if DAG Serialization was enabled the Webserver would not update the DAGs once they are fetched from DB. The default worker_refresh_interval was `30` so whenever the gunicorn workers were restarted, they used to pull the updated DAGs when needed.
This change will allow us to have a larged worker_refresh_interval (e.g 30 mins or even 1 day)
We should not update the "last_updated" column unnecessarily. This is first of few optimizations to DAG Serialization that would also aid in DAG Versioning
Documentation for S3FileTransformOperator states that users
can skip transformation script if S3 Select experession is
specified, but in this case the created file is always
zero bytes long.
This fix changes the behaviour, so in case of no transformation
given, the source file (a result of S3Select) is uploaded.
* Allow `replace` flag in gcs_to_gcs operator.
If we are not replacing, list all files in the Destination GCS bucket and only keep those files which are present in Source GCS bucket and not in Destination GCS bucket
Otherwise at large scale this can end up with some tasks failing as they
try to create the result table at the same time.
This was always possible before, just exceedingly rare, but in large
scale performance testing where I create a lot of tasks quickly
(especially in my HA testing) I hit this a few times.
This is also only a problem for fresh installs/clean DBs, as once these
tables exist the possible race goes away.
This is the same fix from #8909, just for runtime, not test time.