* Refactor Kubernetes operator with git-sync
Currently the implementation of git-sync is broken because:
- git-sync clones the repository in /tmp and not in airflow-dags volume
- git-sync add a link to point to the revision required but it is not
taken into account in AIRFLOW__CORE__DAGS_FOLDER
Dags/logs hostPath volume has been added (needed if airflow run in
kubernetes in local environment)
To avoid false positive in CI `load_examples` is set to `False`
otherwise DAGs from `airflow/example_dags` are always loaded. In this
way is possible to test `import` in DAGs
Remove `worker_dags_folder` config:
`worker_dags_folder` is redundant and can lead to confusion.
In WorkerConfiguration `self.kube_config.dags_folder` defines the path of
the dags and can be set in the worker using airflow_configmap
Refactor worker_configuration.py
Use a docker container to run setup.py
Compile web assets
Fix codecov application path
* Fix kube_config.dags_in_image
Add an operator to create a Docker container in Azure Container
Instances. Azure Container Instances hosts a container and abstracts
away the infrastructure around orchestration of a container service.
Operator supports creating an ACI container and pull an image from Azure
Container Registry or public Docker registries.
- Fixed Documentation in integration.rst
- Fixed Incorrect type in docstring of `AzureCosmosInsertDocumentOperator`
- Added the Hook, Sensor and Operator in code.rst
- Updated the name of example DAG and its filename to follow the convention
Add an operator and hook to manipulate and use Azure
CosmosDB documents, including creation, deletion, and
updating documents and collections.
Includes sensor to detect documents being added to a
collection.
* Better instructions for airflow flower
It is not clear in the documentation that you need to have flower installed to successful run airflow flower. If you don't have flower installed, running airflow flower will show the following error which is not of much help:
airflow flower
[2018-11-20 17:01:14,836] {__init__.py:51} INFO - Using executor SequentialExecutor
Traceback (most recent call last):
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/bin/airflow", line 32, in <module>
args.func(args)
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/lib/python3.6/site-packages/airflow/utils/cli.py", line
74, in wrapper
return f(*args, **kwargs)
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/lib/python3.6/site-packages/airflow/bin/cli.py", line 1
221, in flower
broka, address, port, api, flower_conf, url_prefix])
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/lib/python3.6/os.py", line 559, in execvp
_execvpe(file, args)
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/lib/python3.6/os.py", line 604, in _execvpe
raise last_exc.with_traceback(tb)
File "/mnt/secondary/workspace/f4/typo-backend/pipelines/model-pipeline/airflow/lib/python3.6/os.py", line 594, in _execvpe
exec_func(fullname, *argrest) FileNotFoundError: [Errno 2] No such file or directory
* Update use-celery.rst
The current `airflow flower` doesn't come with any authentication.
This may make essential information exposed in an untrusted environment.
This commit add support to HTTP basic authentication for Airflow Flower
Ref:
https://flower.readthedocs.io/en/latest/auth.html
This adds ASF license headers to all the .rst and .md files with the
exception of the Pull Request template (as that is included verbatim
when opening a Pull Request on Github which would be messy)
[AIRFLOW-2780] Add IMAP Hook to retrieve email attachments
- Add has_mail_attachments to check if there are mail attachments in the given mailbox with the given attachment name
- Add retrieve_mail_attachments to download the attachments to a local directory
- Add some test cases but more are coming
- Add license header
- Change retrieve_mail_attachments to download_mail_attachments
- Add retrieve_mail_attachments that return a list of tuple containing the attachments found
- Change IMAP4_SSL close() method to be called after retrieving the attachments and not before logging out
- Change test_connect to not check for close method because no mail folder will be opened when only connecting
- Add some test cases that are still in WIP
- Fixes a bug causing multiple attachments in a single mail not being correctly added to the all mails attachments
- Fixes a bug where MailPart is_attachment always returns None
- Add logging when an attachment has been found that matches the name
- Add more test cases with sample mail
This re-works the SageMaker functionality in Airflow to be more complete, and more useful for the kinds of operations that SageMaker supports.
We removed some files and operators here, but these were only added after the last release so we don't need to worry about any sort of back-compat.
Add CloudSqlInstanceInsertOperator, CloudSqlInstancePatchOperator and CloudSqlInstanceDeleteOperator.
Each operator includes:
- core logic
- input params validation
- unit tests
- presence in the example DAG
- docstrings
- How-to and Integration documentation
Additionally, small improvements to GcpBodyFieldValidator were made:
- add simple list validation capability (type="list")
- introduced parameter allow_empty, which can be set to False
to test for non-emptiness of a string instead of specifying
a regexp.
Co-authored-by: sprzedwojski <szymon.przedwojski@polidea.com>
Co-authored-by: potiuk <jarek.potiuk@polidea.com>
Once the user has installed Fernet package then the application enforces setting valid Fernet key.
This change will alter this behavior into letting empty Fernet key or special `no encryption` phrase and interpreting those two cases as no encryption desirable.
Airflow Users that wish to create plugins for the new www_rbac UI
can not add plugin views or links. This PR fixes that by letting
a user specify their plugins for www_rbac and maintains backwards
compatibility with the existing plugins system.
* [AIRFLOW-3178] Don't mask defaults() function from ConfigParser
ConfigParser (the base class for AirflowConfigParser) expects defaults()
to be a function - so when we re-assign it to be a property some of the
methods from ConfigParser no longer work.
* [AIRFLOW-3178] Correctly escape percent signs when creating temp config
Otherwise we have a problem when we come to use those values.
* [AIRFLOW-3178] Use os.chmod instead of shelling out
There's no need to run another process for a built in Python function.
This also removes a possible race condition that would make temporary
config file be readable by more than the airflow or run-as user
The exact behaviour would depend on the umask we run under, and the
primary group of our user, likely this would mean the file was readably
by members of the airflow group (which in most cases would be just the
airflow user). To remove any such possibility we chmod the file
before we write to it
Add GceInstanceStartOperator, GceInstanceStopOperator and GceSetMachineTypeOperator.
Each operator includes:
- core logic
- input params validation
- unit tests
- presence in the example DAG
- docstrings
- How-to and Integration documentation
Additionally, in GceHook error checking if response is 200 OK was added:
Some types of errors are only visible in the response's "error" field
and the overall HTTP response is 200 OK.
That is why apart from checking if status is "done" we also check
if "error" is empty, and if not an exception is raised with error
message extracted from the "error" field of the response.
In this commit we also separated out Body Field Validator to
separate module in tools - this way it can be reused between
various GCP operators, it has proven to be usable in at least
two of them now.
Co-authored-by: sprzedwojski <szymon.przedwojski@polidea.com>
Co-authored-by: potiuk <jarek.potiuk@polidea.com>
There were a few more "password" config options added over the last few
months that didn't have _cmd options. Any config option that is a
password should be able to be provided via a _cmd version.
To clarify installation instructions for the google auth backend, add an
install group to `setup.py` that installs dependencies google auth via
`pip install apache-airflow[google_auth]`.
The ProxyFix middleware should only be used when airflow is running
behind a trusted proxy. This patch adds a `USE_PROXY_FIX` flag that
defaults to `False`.
Both Deploy and Delete operators interact with Google
Cloud Functions to manage functions. Both are idempotent
and make use of GcfHook - hook that encapsulates
communication with GCP over GCP API.
The installation instructions failed to mention how to proceed with the GPL dependency. For those who are not concerned by GPL, it is useful to know how to proceed with GPL dependency.
Adhoc queries and Charts features are no longer supported in new
FAB-based webserver and UI. But this is not mentioned at all in the doc
"Data Profiling" (https://airflow.incubator.apache.org/profiling.html)
This commit adds a note to remind users for this.
1. Copying:
Under the hood, it's `boto3.client.copy_object()`.
It can only handle the situation in which the
S3 connection used can access both source and
destination bucket/key.
2. Deleting:
2.1 Under the hood, it's `boto3.client.delete_objects()`.
It supports either deleting one single object or
multiple objects.
2.2 If users try to delete a non-existent object, the
request will still succeed, but there will be an
entry 'Errors' in the response. There may also be
other reasons which may cause similar 'Errors' (
request itself would succeed without explicit
exception). So an argument `silent_on_errors` is added
to let users decide if this sort of 'Errors' should
fail the operator.
The corresponding methods are added into S3Hook, and
these two operators are 'wrappers' of these methods.
Tasks can have start_dates or end_dates separately
from the DAG. These need to be converted to UTC otherwise
we cannot use them for calculation the next execution
date.
- Replace airflow PyPI package name with apache-airflow
- Remove unnecessary quotes on pip install commands
- Make install extras commands consistent with pip docs [1] (e.g.,
alpha sort order without spaces or quotes)
[1]: https://pip.pypa.io/en/stable/reference/pip_install/#examples
Executes a task in a Kubernetes pod in the specified Google Kubernetes
Engine cluster. This makes it easier to interact with GCP kubernetes
engine service because it encapsulates acquiring credentials.
In documentation page "Scheduling & Triggers",
it only mentioned the CLI method to
manually trigger a DAG run.
However, the manual trigger feature in Web UI
should be mentioned as well
(it may be even more frequently used by users).
Just like a partition sensor for Hive,
this PR adds a sensor that waits for
a table to be created in Cassandra cluster.
Closes#3518 from sekikn/AIRFLOW-2640
When Airflow was populating a DagBag from a .zip
file, if a single
file in the root directory did not contain the
strings 'airflow' and
'DAG' it would ignore the entire .zip file.
Also added a small amount of logging to not
bombard user with info
about skipping their .py files.
Closes#3505 from Noremac201/dag_name
Add Google Kubernetes Engine create_cluster,
delete_cluster operators
This allows users to use airflow to create or
delete clusters in the
google cloud platform
Closes#3477 from Noremac201/gke_create
* Updates the GCP hooks to use the google-auth
library and removes
dependencies on the deprecated oauth2client
package.
* Removes inconsistent handling of the scope
parameter for different
auth methods.
Note: using google-auth for credentials requires a
newer version of the
google-api-python-client package, so this commit
also updates the
minimum version for that.
To avoid some annoying warnings about the
discovery cache not being
supported, so disable the discovery cache
explicitly as recommend here:
https://stackoverflow.com/a/44518587/101923
Tested by running:
nosetests
tests/contrib/operators/test_dataflow_operator.py
\
tests/contrib/operators/test_gcs*.py \
tests/contrib/operators/test_mlengine_*.py \
tests/contrib/operators/test_pubsub_operator.py \
tests/contrib/hooks/test_gcp*.py \
tests/contrib/hooks/test_gcs_hook.py \
tests/contrib/hooks/test_bigquery_hook.py
and also tested by running some GCP-related DAGs
locally, such as the
Dataproc DAG example at
https://cloud.google.com/composer/docs/quickstartCloses#3488 from tswast/google-auth
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2526
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
params can be overridden by the dictionary passed
through `airflow backfill -c`
```
templated_command = """
echo "text = {{ params.text }}"
"""
bash_operator = BashOperator(
task_id='bash_task',
bash_command=templated_command,
dag=dag,
params= {
"text" : "normal processing"
})
```
In daily processing it prints:
```
normal processing
```
In backfill processing `airflow trigger_dag -c
"{"text": "override success"}"`, it prints
```
override success
```
### Tests
- [ ] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [x] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3422 from milton0825/params-overridden-
through-cli
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2538
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
Update the faq doc on how to reduce airflow
scheduler latency. This comes from our internal
production setting which also aligns with Maxime's
email(https://lists.apache.org/thread.html/%3CCAHE
Ep7WFAivyMJZ0N+0Zd1T3nvfyCJRudL3XSRLM4utSigR3dQmai
l.gmail.com%3E).
### Tests
- [ ] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
### Commits
- [ ] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [ ] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [ ] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3434 from feng-tao/update_faq
Add docs to faq.rst to talk about how to deal with
Exception: Global variable
explicit_defaults_for_timestamp needs to be on (1)
for mysql
Closes#3429 from milton0825/fix-docs
I'd like to have how-to guides for all connection
types, or at least the
different categories of connection types. I found
it difficult to figure
out how to manage a GCP connection, this commit
add a how-to guide for
this.
Also, since creating and editing connections
really aren't all that
different, the PR renames the "creating
connections" how-to to "managing
connections".
Closes#3419 from tswast/howto
KubernetesPodOperator now accept a dict type
parameter called "affinity", which represents a
group of affinity scheduling rules (nodeAffinity,
podAffinity, podAntiAffinity).
API reference: https://kubernetes.io/docs/referenc
e/generated/kubernetes-api/v1.10/#affinity-v1-core
Closes#3369 from imroc/AIRFLOW-2397