Submitting on behalf of plypaul
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-219
-
https://issues.apache.org/jira/browse/AIRFLOW-398
Testing Done:
- Running on Airbnb prod (though on a different
mergebase) for many months
Credits:
Impersonation Work: georgeke did most of the work
but plypaul did quite a bit of work too.
Cgroups: plypaul did most of the work, I just did
some touch up/bug fixes (see commit history,
cgroups + impersonation commit is actually plypaul
's not mine)
Closes#1934 from aoen/ddavydov/cgroups_and_impers
onation_after_rebase
Added a dag.catchup option and modified the
scheduler to look at the value when scheduling
DagRuns
(by moving dag.start_date up to
dag.previous_schedule),
and added a config option catchup_by_default
(defaults to True) that allows users to set this
to False for all
dags modifying the existing DAGs
In addition, we added a test to jobs.py
(test_dag_catchup_option)
Closes#1830 from
btallman/NoBackfill_clean_feature
This implements a framework for API calls to Airflow. Currently
all access is done by cli or web ui. Especially in the context
of the cli this raises security concerns which can be alleviated
with a secured API call over the wire.
Secondly integration with other systems is a bit harder if you have
to call a cli. For public facing endpoints JSON is used.
As an example the trigger_dag functionality is now made into a
API call.
Backwards compat is retained by switching to a LocalClient.
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
- [AIRFLOW-96](https://issues.apache.org/jira/brow
se/AIRFLOW-96) : allow parameter "s3_conn_id" of
S3KeySensor and S3PrefixSensor to be defined using
an environment variable.
Actually, S3KeySensor and S3PrefixSensor use the
S3hook, which extends BaseHook. BaseHook has
get_connection, which looks a connection up :
- in environment variables first
- and then in the database
Closes#1517 from dm-tran/fix-jira-airflow-96
- Update the tutorial with a comment helping to explain the use of default_args and
include all the possible parameters in line
- Clarify in the FAQ the possibility of an unexpected default `schedule_interval`in case
airflow users mistakenly try to overwrite the default `schedule_interval` in a DAG's
`default_args` parameter
* Distinguish between module and non-module plugin
components
* Fix handling of non-module plugin components
* admin views, flask blueprints, and menu links
need to not be
wrapped in modules
* Fix improper use of zope.deprecation.deprecated
* zope.deprecation.deprecated does NOT support
classes as
first parameter
* deprecating classes must be handled by calling
the deprecate
function on the class name
* Added tests for plugin loading
* Updated plugin documentation to match test
plugin
* Updated executors to always load plugins
* More logging
Closes#1738 from gwax/plugin_module_fixes
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
https://issues.apache.org/jira/browse/AIRFLOW-530
Right now, the documentation does not clearly
state that connection names are converted to
uppercase form when searched in the environment
(https://github.com/apache/incubator-airflow/blob/
master/airflow/hooks/base_hook.py#L60-L60).
This is confusing as the best practice in Airflow
seems to be to define connections in lower case
form.
Closes#1811 from danielzohar/connection_env_var
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-198
Testing Done:
- Local testing of dag operation with
LatestOnlyOperator
- Unit test added
Closes#1752 from gwax/latest_only
SSL can now be enabled by providing certificate
and key in the usual
ways (config file or CLI options). Providing the
cert and key will
automatically enable SSL. The web server port will
not automatically
change.
The Security page in the docs now includes an SSL
section with basic
setup information.
Closes#1760 from caseyching/master
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-512
Testing Done:
- N/A, but ran core tests: `./run_unit_tests.sh
tests.core:CoreTest -s`
Closes#1800 from dgingrich/master
Here is the original PR with Max's LGTM:
https://github.com/aoen/incubator-airflow/pull/1
Since then I have made some fixes but this PR is essentially the same.
It could definitely use more eyes as there are likely still issues.
**Goals**
- Simplify, consolidate, and make consistent the logic of whether or not
a task should be run
- Provide a view/better logging that gives insight into why a task
instance is not currently running (no more viewing the scheduler logs
to find out why a task instance isn't running for the majority of
cases):
![image](https://cloud.githubusercontent.com/assets/1592778/17637621/aa669f5e-6099-11e6-81c2-d988d2073aac.png)
**Notable Functional Changes**
- Webserver view + task_failing_deps CLI command to explain why a given
task instance isn't being run by the scheduler
- Running a backfill in the command line and running a task in the UI
will now display detailed error messages based on which dependencies
were not met for a task instead of appearing to succeed but actually
failing silently
- Maximum task concurrency and pools are not respected by backfills
- Backfill now has the equivalent of the old force flag to run even for
successful tasks
This will break one use case:
Using pools to restrict some resource on airflow executors themselves
(rather than an external resource like a DB), e.g. some task uses 60%
of cpu on a worker so we restrict that task's pool size to 1 to
prevent two of the tasks from running on the same host. When
backfilling a task of this type, now the backfill will wait on the
pool to have slots open up before running the task even though we
don't need to do this if backfilling on a different host outside of
the pool. I think breaking this use case is OK since the use case is a
hack due to not having a proper resource isolation solution (e.g.
mesos should be used in this case instead).
- To make things less confusing for users, there is now a "ignore all
dependencies" option for running tasks, "ignore dependencies" has been
renamed to "ignore task dependencies", and "force" has been renamed to
"ignore task instance state". The new "Ignore all dependencies" flag
will ignore the following:
- task instance's pool being full
- execution date for a task instance being in the future
- a task instance being in the retry waiting period
- the task instance's task ending prior to the task instance's
execution date
- task instance is already queued
- task instance has already completed
- task instance is in the shutdown state
- WILL NOT IGNORE task instance is already running
- SLA miss emails will now include all tasks that did not finish for a
particular DAG run, even if
the tasks didn't run because depends_on_past was not met for a task
- Tasks with pools won't get queued automatically the first time they
reach a worker; if they are ready to run they will be run immediately
- Running a task via the UI or via the command line (backfill/run
commands) will now log why a task could not get run if one if it's
dependencies isn't met. For tasks kicked off via the web UI this
means that tasks don't silently fail to get queued despite a
successful message in the UI.
- Queuing a task into a pool that doesn't exist will now get stopped in
the scheduler instead of a worker
**Follow Up Items**
- Update the docs to reference the new explainer views/CLI command
Closes#1729 from aoen/ddavydov/blockedTIExplainerRebasedMaster
Add Google authentication backend.
Add Google authentication information to security
docs.
Dear Airflow Maintainers,
Please accept this PR that addresses the following
issues:
-
https://issues.apache.org/jira/browse/AIRFLOW-444
Testing Done:
- Tested Google authentication backend locally
with no issues
This is mostly an adaptation of the GHE
authentication backend.
Closes#1747 from ananya77041/google_auth_backend
AIRFLOW-328
https://issues.apache.org/jira/browse/AIRFLOW-328
Previously, Airflow had both a default template for airflow.cfg AND a
dictionary of default values. Frequently, these get out of sync (an
option in one has a different value than in the other, or isn’t present
in the other). This commit removes the default dict and uses the
airflow.cfg template to provide defaults. The ConfigParser first reads
the template, loading all the options it contains, and then reads the
user’s actual airflow.cfg to overwrite the default values with any new
ones.
AIRFLOW-371
https://issues.apache.org/jira/browse/AIRFLOW-371
Calling test_mode() didn't actually change Airflow's configuration! This actually wasn't an issue in unit tests because the unit test run script was hardcoded to point at the unittest.cfg file, but it needed to be fixed.
[AIRFLOW-328] Remove redundant default configuration
Previously, Airflow had both a default template
for airflow.cfg AND a dictionary of default
values. Frequently, these get out of sync (an
option in one has a different value than in the
other, or isn’t present in the other). This commit
removes the default dict and uses the airflow.cfg
template to provide defaults. The ConfigParser
first reads the template, loading all the options
it contains, and then reads the user’s actual
airflow.cfg to overwrite the default values with
any new ones.
[AIRFLOW-371] Make test_mode() functional
Previously, calling test_mode() didn’t actually
do anything.
This PR renames it to load_test_config() (to
avoid confusion, ht @r39132).
In addition, manually entering test_mode after
Airflow launches might be too late — some
options have already been loaded (DAGS_FOLDER,
etc.). This makes it so setting
tests/unit_test_mode OR the equivalent env var
(AIRFLOW__TESTS__UNIT_TEST_MODE) will load the
test config immediately, prior to loading the
rest of Airflow.
Closes#1677 from jlowin/Simplify-config
- Added Apache license header for files with extension (.service, .in, .mako, .properties, .ini, .sh, .ldif, .coveragerc, .cfg, .yml, .conf, .sql, .css, .js, .html, .xml.
- Added/Replaced shebang on all .sh files with portable version - #!/usr/bin/env bash.
- Skipped third party css and js files. Skipped all minified js files as well.
Closes#1598 from ajayyadava/248
This particular issue arises because of an alignment issue between
start_date and schedule_interval. This can only happen with cron-based
schedule_intervals that describe absolute points in time (like “1am”) as
opposed to time deltas (like “every hour”)
In the past (and in the docs) we have simply said that users must make
sure the two params agree. But this is counter intuitive. As in these
cases, start_date is sort of like telling the scheduler to
“start paying attention” as opposed to “this is my first execution date”.
This patch changes the behavior of the scheduler. The next run date of
the dag will be treated as "start_date + interval" unless the start_date
is on the (previous) interval in which case the start_date will be the
next run date.
Dear Airflow Maintainers,
Please accept this PR that addresses the following issues:
- *https://issues.apache.org/jira/browse/AIRFLOW-155*
Thanks,
Sumit
Author: Sumit Maheshwari <sumitm@qubole.com>
Closes#1560 from msumit/AIRFLOW-155.
Currently dags are being read directly from the filesystem. Any
hierarchy (python namespaces, modules) need to be reflected on
the filesystem. This makes it hard to manage dags and their
depedencies.
This patch adds support for dags in zip files. It will add
the zip to sys.path and then it will read the zip file and
try to import any files as modules that are in the root of
the zip.
Please note that any module contained within the zip will
overwrite existing modules in the same namespace.
- Operators can be created without DAGs, but the DAG can be added at
any time thereafter (by assigning to the ‘dag’ attribute). Once a DAG
is assigned, it can not be removed or reassigned.
- Operators can infer DAGs from other operators. Setting a relationship
will also set the DAG, if possible. Operators from different DAGs and
operators with no DAGs can not be chained.
- DAGs can be used as context managers. When “inside” a DAG context
manager, the default DAG for all new Operators is that DAG (unless they
specify a different one)
- Unit tests
- Add default owner for Operators
- Support composing operators with >> and <<
Three special cases:
op1 >> op2 is equivalent to op.set_downstream(op2)
op1 << op2 is equivalent to op1.set_upstream(op2)
dag >> op1 (in any order or direction) means op1.dag = dag
These can be chained:
dag >> op1 >> op2 << op3
- Update concepts documentation
The plugins tutorial was lacking in the following ways:
1. I wasn't sure where my template should live
2. I wasn't aware that both the TestView and Blueprint were necessary
In lieu of a code refactor, here's my suggestion on how to make the documentation more helpful from the perspective of someone who doesn't have experience with Flask Blueprints and Flask Admin, which can prevent the deep-dive into the code and supporting libs that I just did!