Executes a task in a Kubernetes pod in the specified Google Kubernetes
Engine cluster. This makes it easier to interact with GCP kubernetes
engine service because it encapsulates acquiring credentials.
In documentation page "Scheduling & Triggers",
it only mentioned the CLI method to
manually trigger a DAG run.
However, the manual trigger feature in Web UI
should be mentioned as well
(it may be even more frequently used by users).
Just like a partition sensor for Hive,
this PR adds a sensor that waits for
a table to be created in Cassandra cluster.
Closes#3518 from sekikn/AIRFLOW-2640
When Airflow was populating a DagBag from a .zip
file, if a single
file in the root directory did not contain the
strings 'airflow' and
'DAG' it would ignore the entire .zip file.
Also added a small amount of logging to not
bombard user with info
about skipping their .py files.
Closes#3505 from Noremac201/dag_name
Add Google Kubernetes Engine create_cluster,
delete_cluster operators
This allows users to use airflow to create or
delete clusters in the
google cloud platform
Closes#3477 from Noremac201/gke_create
* Updates the GCP hooks to use the google-auth
library and removes
dependencies on the deprecated oauth2client
package.
* Removes inconsistent handling of the scope
parameter for different
auth methods.
Note: using google-auth for credentials requires a
newer version of the
google-api-python-client package, so this commit
also updates the
minimum version for that.
To avoid some annoying warnings about the
discovery cache not being
supported, so disable the discovery cache
explicitly as recommend here:
https://stackoverflow.com/a/44518587/101923
Tested by running:
nosetests
tests/contrib/operators/test_dataflow_operator.py
\
tests/contrib/operators/test_gcs*.py \
tests/contrib/operators/test_mlengine_*.py \
tests/contrib/operators/test_pubsub_operator.py \
tests/contrib/hooks/test_gcp*.py \
tests/contrib/hooks/test_gcs_hook.py \
tests/contrib/hooks/test_bigquery_hook.py
and also tested by running some GCP-related DAGs
locally, such as the
Dataproc DAG example at
https://cloud.google.com/composer/docs/quickstartCloses#3488 from tswast/google-auth
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2526
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
params can be overridden by the dictionary passed
through `airflow backfill -c`
```
templated_command = """
echo "text = {{ params.text }}"
"""
bash_operator = BashOperator(
task_id='bash_task',
bash_command=templated_command,
dag=dag,
params= {
"text" : "normal processing"
})
```
In daily processing it prints:
```
normal processing
```
In backfill processing `airflow trigger_dag -c
"{"text": "override success"}"`, it prints
```
override success
```
### Tests
- [ ] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [x] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3422 from milton0825/params-overridden-
through-cli
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2538
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
Update the faq doc on how to reduce airflow
scheduler latency. This comes from our internal
production setting which also aligns with Maxime's
email(https://lists.apache.org/thread.html/%3CCAHE
Ep7WFAivyMJZ0N+0Zd1T3nvfyCJRudL3XSRLM4utSigR3dQmai
l.gmail.com%3E).
### Tests
- [ ] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
### Commits
- [ ] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [ ] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [ ] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3434 from feng-tao/update_faq
Add docs to faq.rst to talk about how to deal with
Exception: Global variable
explicit_defaults_for_timestamp needs to be on (1)
for mysql
Closes#3429 from milton0825/fix-docs
I'd like to have how-to guides for all connection
types, or at least the
different categories of connection types. I found
it difficult to figure
out how to manage a GCP connection, this commit
add a how-to guide for
this.
Also, since creating and editing connections
really aren't all that
different, the PR renames the "creating
connections" how-to to "managing
connections".
Closes#3419 from tswast/howto
KubernetesPodOperator now accept a dict type
parameter called "affinity", which represents a
group of affinity scheduling rules (nodeAffinity,
podAffinity, podAntiAffinity).
API reference: https://kubernetes.io/docs/referenc
e/generated/kubernetes-api/v1.10/#affinity-v1-core
Closes#3369 from imroc/AIRFLOW-2397
Add AzureDataLakeHook as a first step to enable
Airflow connect to
Azure Data Lake.
The hook has a simple interface to upload and
download files with all
parameters available in Azure Data Lake sdk and
also a check_for_file
to query if a file exists in data lake.
[AIRFLOW-2420] Add functionality for Azure Data
Lake
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW-242
0) issues and references them in the PR title.
-
https://issues.apache.org/jira/browse/AIRFLOW-2420
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
This PR creates Azure Data Lake hook
(adl_hook.AdlHook) and all the setup required to
create a new Azure Data Lake connection.
### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
Adds tests to airflow.hooks.adl_hook.py in
tests.hooks.test_adl_hook.py
### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [x] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3333 from marcusrehm/master
Add lineage support by having inlets and oulets
that
are made available to dependent upstream or
downstream
tasks.
If configured to do so can send lineage data to a
backend. Apache Atlas is supported out of the box.
Closes#3321 from bolkedebruin/lineage_exp
This PR adds an undocumented AWS-related operator
into the "Integration" section and fixes some
obsolete description.
Closes#3340 from sekikn/AIRFLOW-2446
Searching through all the documentation I couldn't
find anywhere
that explained what file format it expected for
uploading settings.
Closes#2802 from bovard/variable_files_are_json
I parsed it with the ol' eyeball compiler. Someone
could flake8 it better, perhaps.
Changes:
- correct `def` syntax on line 50
- use literal dict on line 67
Closes#2479 from 0atman/patch-1
In its current form, MesosExecutor schedules tasks
on mesos slaves which
just contain airflow commands assuming that the
mesos slaves already
have airflow installed and configured on them.
This assumption goes
against the Mesos philosophy of having a
heterogeneous cluster.
Since Mesos provides an option to pull a Docker
image before actually
running the actual task/command so this
improvement changes the
mesos_executor.py to specify an optional docker
image containing
airflow which can be pulled on slaves before
running the actual
airflow command. This also opens the door for an
optimization of
resources in a future PR, by allowing the
specification of CPU and
memory needed for each airflow task.
Closes#3008 from agrajm/AIRFLOW-2068
The logs are kept inside of the worker pod. By
attaching a persistent
disk we keep the logs and make them available for
the webserver.
- Remove the requirements.txt since we dont want
to maintain another
dependency file
- Fix some small casing stuff
- Removed some unused code
- Add missing shebang lines
- Started on some docs
- Fixed the logging
Closes#3252 from Fokko/airflow-2357-pd-for-logs