Add AzureDataLakeHook as a first step to enable
Airflow connect to
Azure Data Lake.
The hook has a simple interface to upload and
download files with all
parameters available in Azure Data Lake sdk and
also a check_for_file
to query if a file exists in data lake.
[AIRFLOW-2420] Add functionality for Azure Data
Lake
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW-242
0) issues and references them in the PR title.
-
https://issues.apache.org/jira/browse/AIRFLOW-2420
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
This PR creates Azure Data Lake hook
(adl_hook.AdlHook) and all the setup required to
create a new Azure Data Lake connection.
### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
Adds tests to airflow.hooks.adl_hook.py in
tests.hooks.test_adl_hook.py
### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.
### Code Quality
- [x] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#3333 from marcusrehm/master
Add lineage support by having inlets and oulets
that
are made available to dependent upstream or
downstream
tasks.
If configured to do so can send lineage data to a
backend. Apache Atlas is supported out of the box.
Closes#3321 from bolkedebruin/lineage_exp
This is caused due to the fact that the latest
release (2.4) for apache-beam[gcp] is not
available for Python 3.x. Also as we are using
Google's discovery based API for all google cloud
related commands we don't require to import
google-cloud-dataflow package
Closes#3273 from kaxil/patch-3
The package hmsclient is Python2/3 compatible and
offer a handy context
manager to handle opening and closing connections.
Closes#3239 from gglanzani/AIRFLOW-2336
As of 0.17.0 dask distributed has support for
TLS/SSL.
[dask] Added TLS/SSL support for the dask-
distributed scheduler.
As of 0.17.0 dask distributed has support for
TLS/SSL.
Add a test for tls under dask distributed
Closes#2683 from mariusvniekerk/dask-ssl
Currently, S3FileTransformOperator downloads the
whole file from S3
before transforming and uploading it. Adding
extraction feature using
S3 Select to this operator improves its efficiency
and usablitily.
Closes#3227 from sekikn/AIRFLOW-2299
S3FileTransformOperator doen't work for now since
it uses a function
which is no longer supported by boto3. This PR
replaces it with a
valid function and adds an unit test for this
operator.
Closes#3200 from sekikn/AIRFLOW-2293
One of the dependencies was pulling in
a GPL library by default. With the new
release of python-nvd3 this is now solved.
Closes#3160 from bolkedebruin/legal
TaskInstances are sometimes instantiated outside
core
Airflow with naive datetimes. In case this happens
we
now default to using the time zone of the DAG if
that
is available or the default system time zone.
Closes#2946 from bolkedebruin/AIRFLOW-1927
This enables Airflow and Celery Flower to live
below root. Draws on the work of Geatan Semet
(@Stibbons).
This closes#2723 and closes#2818Closes#2952 from bolkedebruin/AIRFLOW-1755
Update Celery to 4.0.2 for fixing error
TypeError: '<=' not supported between instances of
'NoneType' and 'int'
Hi all,
I'd like to update Celery to version 4.0.2. While
updating my Docker container to version 1.9, I
caught this error:
```
worker_1 | [2018-01-03 10:34:29,934:
CRITICAL/MainProcess] Unrecoverable error:
TypeError("'<=' not supported between instances of
'NoneType' and 'int'",)
worker_1 | Traceback (most recent call last):
worker_1 | File "/usr/local/lib/python3.6
/site-packages/celery/worker/worker.py", line 203,
in start
worker_1 | self.blueprint.start(self)
worker_1 | File "/usr/local/lib/python3.6
/site-packages/celery/bootsteps.py", line 115, in
start
worker_1 | self.on_start()
worker_1 | File "/usr/local/lib/python3.6
/site-packages/celery/apps/worker.py", line 143,
in on_start
worker_1 | self.emit_banner()
worker_1 | File "/usr/local/lib/python3.6
/site-packages/celery/apps/worker.py", line 159,
in emit_banner
worker_1 |
string(self.colored.reset(self.extra_info() or
'')),
worker_1 | File "/usr/local/lib/python3.6
/site-packages/celery/apps/worker.py", line 188,
in extra_info
worker_1 | if self.loglevel <=
logging.INFO:
worker_1 | TypeError: '<=' not supported
between instances of 'NoneType' and 'int'
```
This is because I've been running Python 2 in my
local environments, and the Docker image is Python
3:
https://github.com/puckel/docker-
airflow/pull/143/files
This is the issue in Celery:
https://github.com/celery/celery/blob/0dde9df9d8dd
5dbbb97ef75a81757bc2d9a4b33e/Changelog#L145
Make sure you have checked _all_ steps below.
### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-
XXX
### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:
### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:
### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
- [x] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`
Closes#2914 from Fokko/AIRFLOW-1967-update-celery
Since S3Hook is reimplemented based on the AwsHook
using boto3, its package dependencies need to be
updated as well.
Closes#2790 from m1racoli/fix-setup-s3
python-daemon declares its docutils dependency in a setup_requires
clause, and 'python setup.py install' fails since it misses
that dependency.
Closes#2765 from wrp/docutils
https://github.com/spulec/moto/pull/1048 introduced `docker` as a
dependency in Moto, causing a conflict as Airflow uses `docker-py`. As
both packages don't work together, Moto is pinned to the version
prior to that change.
JayDeBeApi made a backwards incompatible change
This updates the JDBC Hook's implementation
and changes the required JayDeBeApi to >= 1.1.1
Closes#2651 from r-richmond/AIRFLOW-926
The celery config is currently part of the celery executor definition.
This is really inflexible for users wanting to change it. In addition
Celery 4 is moving to lowercase.
Closes#2542 from bolkedebruin/upgrade_celery
By default `find_packages()` will find _any_ valid
python package,
including things under tests. We don't want to
install the tests
packages into the python path, so exclude those.
Closes#2597 from ashb/AIRFLOW-1594-dont-install-
tests
Clean the way of logging within Airflow. Remove
the old logging.py and
move to the airflow.utils.log.* interface. Remove
setting the logging
outside of the settings/configuration code. Move
away from the string
format to logging_function(msg, *args).
Closes#2592 from Fokko/AIRFLOW-1582-Improve-
logging-structure
Make the druid operator and hook more specific.
This allows us to
have a more flexible configuration, for example
ingest parquet.
Also get rid of the PyDruid extension since it is
more focussed on
querying druid, rather than ingesting data. Just
requests is
sufficient to submit an indexing job. Add a test
to the hive_to_druid
operator to make sure it behaves as we expect.
Furthermore cleaned
up the docstring a bit
Closes#2378 from Fokko/AIRFLOW-1324-make-more-
general-druid-hook-and-operator
1. Upgrade qds_sdk version to latest
2. Add support to run Zeppelin Notebooks
3. Move out initialization of QuboleHook from
init()
Closes#2322 from msumit/AIRFLOW-1192
For now, SecurityTests.test_csrf_rejection fails
because flask-wtf version specified in setup.py is
too old.
This PR fixes it.
Closes#2280 from sekikn/AIRFLOW-1180
Per Apache requirements Airflow should be branded
Apache Airflow.
It is impossible to provide a forward compatible
automatic update
path and users will be required to manually
upgrade.
Closes#2172 from bolkedebruin/AIRFLOW-1000
Add DatabricksSubmitRun Operator
In this PR, we contribute a DatabricksSubmitRun operator and a
Databricks hook. This operator enables easy integration of Airflow
with Databricks. In addition to the operator, we have created a
databricks_default connection, an example_dag using this
DatabricksSubmitRunOperator, and matching documentation.
Closes#2202 from andrewmchen/databricks-operator-
squashed
This PR implements a hook to interface with Azure
storage over wasb://
via azure-storage; adds sensors to check for blobs
or prefixes; and
adds an operator to transfer a local file to the
Blob Storage.
Design is similar to that of the S3Hook in
airflow.operators.S3_hook.
Closes#2216 from hgrif/AIRFLOW-1065
We add the Apache-licensed bleach library and use
it to sanitize html
passed to Markup (which is supposed to be already
escaped). This avoids
some XSS issues with unsanitized user input being
displayed.
Closes#2193 from saguziel/aguziel-xss
This PR includes a redis_hook and a redis_key_sensor to enable
checking for key existence in redis. It also updates the
documentation and add the relevant unit tests.
- [x] Opened a PR on Github
- [x] My PR addresses the following Airflow JIRA
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-999
- [x] The PR title references the JIRA issues. For
example, "[AIRFLOW-1] My Airflow PR"
- [x] My PR adds unit tests
- [ ] __OR__ my PR does not need testing for this
extremely good reason:
- [x] Here are some details about my PR:
- [ ] Here are screenshots of any UI changes, if
appropriate:
- [x] Each commit subject references a JIRA issue.
For example, "[AIRFLOW-1] Add new feature"
- [x] Multiple commits addressing the same JIRA
issue have been squashed
- [x] My commits follow the guidelines from "[How
to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
Closes#2165 from msempere/AIRFLOW-999/support-
for-redis-database
This silences deprecation warnings, e.g.
airflow/airflow/utils/dag_processing.py:578:
DeprecationWarning: The
'warn' method is deprecated, use 'warning' instead
Closes#2082 from imbaczek/bug871
Submitting on behalf of plypaul
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-219
-
https://issues.apache.org/jira/browse/AIRFLOW-398
Testing Done:
- Running on Airbnb prod (though on a different
mergebase) for many months
Credits:
Impersonation Work: georgeke did most of the work
but plypaul did quite a bit of work too.
Cgroups: plypaul did most of the work, I just did
some touch up/bug fixes (see commit history,
cgroups + impersonation commit is actually plypaul
's not mine)
Closes#1934 from aoen/ddavydov/cgroups_and_impers
onation_after_rebase
This implements a framework for API calls to Airflow. Currently
all access is done by cli or web ui. Especially in the context
of the cli this raises security concerns which can be alleviated
with a secured API call over the wire.
Secondly integration with other systems is a bit harder if you have
to call a cli. For public facing endpoints JSON is used.
As an example the trigger_dag functionality is now made into a
API call.
Backwards compat is retained by switching to a LocalClient.
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-198
Testing Done:
- Local testing of dag operation with
LatestOnlyOperator
- Unit test added
Closes#1752 from gwax/latest_only
Instead of parsing the DAG definition files in the same process as the
scheduler, this change parses the files in a child process. This helps
to isolate the scheduler from bad user code.
Closes#1636 from plypaul/plypaul_schedule_by_file_rebase_master
Highcharts' license is not compatible with the Apache 2.0
license. This patch removes Highcharts in favor of d3,
however some charts are not supported anymore.
* This brings Maxime Beauchemin's work to master
The new flask-admin==1.4.1 release on 2016-06-13 breaks the Airflow
release currently in Pypi (1.7.1.2). This fixes the edge case triggered
by this new release.
* Closes#1588 on github
Airflow spawns childs in the form of a webserver, scheduler, and executors.
If the parent gets terminated (SIGTERM) it needs to properly propagate the
signals to the childs otherwise these will get orphaned and end up as
zombie processes. This patch resolves that issue.
In addition Airflow does not store the PID of its services so they can be
managed by traditional unix systems services like rc.d / upstart / systemd
and the likes. This patch adds the "--pid" flag. By default it stores the
PID in ~/airflow/airflow-<service>.pid
Lastly, the patch adds support for different log file locations: log,
stdout, and stderr (respectively: --log-file, --stdout, --stderr). By
default these are stored in ~/airflow/airflow-<service>.log/out/err.
* Resolves ISSUE-852
* Added ability to skip DAG elements based on raised Exception
* Added nose-parameterized to test dependencies
* Fix for broken mysql test - provided by jlowin
BaseOperator silently accepts any arguments. This deprecates the
behavior with a warning that says it will be forbidden in Airflow 2.0.
This PR also turns on DeprecationWarnings by default, which in turn
revealed that inspect.getargspec is deprecated. Here it is replaced by
`inspect.signature` (Python 3) or `funcsigs.signature` (Python 2).
Lastly, this brought to attention that example_http_operator was
passing an illegal argument.
snakebite library just added the support to specify hdfs_namenode_principal
for Kerberos auth method, and this PR allows users to pass in this config from HDFSHook
Also bump the version of snakebite