There are still celeryd_concurrency occurrences
left in the code
this needs to be renamed to worker_concurrency to
make the config
with Celery consistent
Closes#2870 from Fokko/AIRFLOW-1911-update-
airflow-config
Options were set to visibility timeout instead of
broker_options
directly. Furthermore, options should be int,
float, bool or string
not all string.
Closes#2867 from bolkedebruin/AIRFLOW-1908
Explicitly set the celery backend from the config
and align the config
with the celery config as this might be confusing.
Closes#2806 from Fokko/AIRFLOW-1840-Fix-celery-
config
https://github.com/spulec/moto/pull/1048 introduced `docker` as a
dependency in Moto, causing a conflict as Airflow uses `docker-py`. As
both packages don't work together, Moto is pinned to the version
prior to that change.
In the very early days, the Airflow scheduler
needed to be restarted
every so often to take new DAG_FOLDERS mutations
into account properly. This is no longer
required.
Closes#2677 from mistercrunch/scheduler_runs
The celery config is currently part of the celery executor definition.
This is really inflexible for users wanting to change it. In addition
Celery 4 is moving to lowercase.
Closes#2542 from bolkedebruin/upgrade_celery
In all the popular languages the variable name log
is the de facto
standard for the logging. Rename LoggingMixin.py
to logging_mixin.py
to comply with the Python standard.
When using the .logger a deprecation warning will
be emitted.
Closes#2604 from Fokko/AIRFLOW-1604-logger-to-log
Make the druid operator and hook more specific.
This allows us to
have a more flexible configuration, for example
ingest parquet.
Also get rid of the PyDruid extension since it is
more focussed on
querying druid, rather than ingesting data. Just
requests is
sufficient to submit an indexing job. Add a test
to the hive_to_druid
operator to make sure it behaves as we expect.
Furthermore cleaned
up the docstring a bit
Closes#2378 from Fokko/AIRFLOW-1324-make-more-
general-druid-hook-and-operator
1. Upgrade qds_sdk version to latest
2. Add support to run Zeppelin Notebooks
3. Move out initialization of QuboleHook from
init()
Closes#2322 from msumit/AIRFLOW-1192
Rename all unit tests under tests/contrib to start
with test_* and fix
broken unit tests so that they run for the Python
2 and 3 builds.
Closes#2234 from hgrif/AIRFLOW-1094
This PR implements a hook to interface with Azure
storage over wasb://
via azure-storage; adds sensors to check for blobs
or prefixes; and
adds an operator to transfer a local file to the
Blob Storage.
Design is similar to that of the S3Hook in
airflow.operators.S3_hook.
Closes#2216 from hgrif/AIRFLOW-1065
Both the ShortCircuitOperator, Branchoperator and LatestOnlyOperator
were arbitrarily changing the states of TaskInstances without locking
them in the database. As the scheduler checks the state of dag runs
asynchronously the dag run state could be set to failed while the
operators are updating the downstream tasks.
A better fix would to use the dag run iteself in the context of the
Operator.
The return from the subprocess is in bytes when
the universal
newlines is set to False (default). This will fail
in Py3 and
works fine in Py2. And with a working unit test.
Closes#2158 from abij/AIRFLOW-840
We add the Apache-licensed bleach library and use
it to sanitize html
passed to Markup (which is supposed to be already
escaped). This avoids
some XSS issues with unsanitized user input being
displayed.
Closes#2193 from saguziel/aguziel-xss
Avoid unnecessary backfills by having start dates
of
just a few days ago. Adds a utility function
airflow.utils.dates.days_ago().
Closes#2068 from jlowin/example-start-date
Submitting on behalf of plypaul
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-219
-
https://issues.apache.org/jira/browse/AIRFLOW-398
Testing Done:
- Running on Airbnb prod (though on a different
mergebase) for many months
Credits:
Impersonation Work: georgeke did most of the work
but plypaul did quite a bit of work too.
Cgroups: plypaul did most of the work, I just did
some touch up/bug fixes (see commit history,
cgroups + impersonation commit is actually plypaul
's not mine)
Closes#1934 from aoen/ddavydov/cgroups_and_impers
onation_after_rebase
Extend SchedulerJob to instrument the execution
performance of task instances contained in each
DAG.
We want to know if any DAG is starved of resources,
and this will be reflected in the stats printed
out at the end of the test run.
Extend SchedulerJob to instrument the execution
performance of task instances contained in each
DAG. We want to know if any DAG is starved of
resources, and this will be reflected in the stats
printed out at the end of the test run.
this test is for instrumenting
the operational impact of
https://github.com/apache/incubator-
airflow/pull/1906
Closes#1919 from vijaysbhat/scheduler_perf_tool
This implements a framework for API calls to Airflow. Currently
all access is done by cli or web ui. Especially in the context
of the cli this raises security concerns which can be alleviated
with a secured API call over the wire.
Secondly integration with other systems is a bit harder if you have
to call a cli. For public facing endpoints JSON is used.
As an example the trigger_dag functionality is now made into a
API call.
Backwards compat is retained by switching to a LocalClient.
Both utcnow() and now() return fractional seconds. These
are sometimes used in primary_keys (eg. in task_instance).
If MySQL is not configured to store these fractional seconds
a primary key might fail (eg. at session.merge) resulting in
a duplicate entry being added or worse.
Postgres does store fractional seconds if left unconfigured,
sqlite needs to be examined.
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-512
Testing Done:
- N/A, but ran core tests: `./run_unit_tests.sh
tests.core:CoreTest -s`
Closes#1800 from dgingrich/master
Travis cache can have a faulty files. This results in builds
that fail as they are dependent on certain components being
available, ie. hive. This addresses the issue for hive by
redownloading if unpacking fails.
- Tell gunicorn to prepend `[ready]` to worker process name once worker is ready (to serve requests) - in particular this happens after DAGs folder is parsed
- Airflow cli runs gunicorn as a child process instead of `excecvp`-ing over itself
- Airflow cli monitors gunicorn worker processes and restarts them by sending TTIN/TTOU signals to the gunicorn master process
- Fix bug where `conf.get('webserver', 'workers')` and `conf.get('webserver', 'webserver_worker_timeout')` were ignored
- Alternatively, https://github.com/apache/incubator-airflow/pull/1684/files does the same thing but the worker-restart script is provided separately for the user to run
- Start airflow, observe that workers are restarted
- Add new dags to dags folder and check that they show up
- Run `siege` against airflow while server is restarting and confirm that all requests succeed
- Run with configuration set to `batch_size = 0`, `batch_size = 1` and `batch_size = 4`
Closes#1685 from zodiac/xuanji_gunicorn_rolling_restart_2
Instead of parsing the DAG definition files in the same process as the
scheduler, this change parses the files in a child process. This helps
to isolate the scheduler from bad user code.
Closes#1636 from plypaul/plypaul_schedule_by_file_rebase_master