incubator-airflow

Граф коммитов

Автор	SHA1	Сообщение	Дата
Ash Berlin-Taylor	f603b36aa4	Ensure that manually creating a DAG run doesn't "block" the scheduler (#11732 ) It was possible to "block" the scheduler such that it would not schedule or queue tasks for a dag if you triggered a DAG run when the DAG was already at the max active runs. This approach works around the problem for now, but a better longer term fix for this would be to introduce a "queued" state for DagRuns, and then when manually creating dag runs (or clearing) set it to queued, and only have the scheduler set DagRuns to running, nothing else -- this would mean we wouldn't need to examine active runs in the TI part of the scheduler loop, only in DagRun creation part. Fixes #11582	2020-10-23 09:51:03 +01:00
Kaxil Naik	7c6dfcb0bf	Use unittest.mock instead of backported mock library (#11643 ) mock is now part of the Python standard library, available as unittest.mock in Python 3.3 onwards.	2020-10-22 13:23:15 +01:00
Ash Berlin-Taylor	8045cc215d	Stop scheduler from thinking that upstream_failed tasks are running (#11730 ) This was messing up the "max_active_runs" calculation, and this fix is a "hack" until we add a better approach of adding a queued state to DagRuns -- at which point we don't even have to do this calculation at all.	2020-10-22 13:11:37 +01:00
Kaxil Naik	fd8b07c6bb	Remove usage of six (#11645 ) Since we support Py 3.6, there is no need of six library	2020-10-19 09:03:48 +02:00
Tomek Urbaszek	112f7d7169	Add creating_job_id to DagRun table (#11396 ) This PR introduces creating_job_id column in DagRun table that links a DagRun to job that created it. Part of #11302 Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>	2020-10-17 12:31:07 +02:00
Ash Berlin-Taylor	0c5bbe83c6	Replace methods on state with frozenset properties (#11576 ) Although these lists are short, there's no need to re-create them each time, and also no need for them to be a method. I have made them lowercase (`finished`, `running`) instead of uppercase (`FINISHED`, `RUNNING`) to distinguish them from the actual states.	2020-10-16 21:09:36 +01:00
Ash Berlin-Taylor	623d5cdaff	Spend less time waiting for LocalTaskJob's subprocss process to finish (#11373 ) * Spend less time waiting for LocalTaskJob's subprocss process to finish This is about is about a 20% speed up for short running tasks! This change doesn't affect the "duration" reported in the TI table, but does affect the time before the slot is freeded up from the executor - which does affect overall task/dag throughput. (All these tests are with the same BashOperator tasks, just running `echo 1`.) Before ``` Task airflow.executors.celery_executor.execute_command[5e0bb50c-de6b-4c78-980d-f8d535bbd2aa] succeeded in 6.597011625010055s: None Task airflow.executors.celery_executor.execute_command[0a39ec21-2b69-414c-a11b-05466204bcb3] succeeded in 6.604327297012787s: None ``` After ``` Task airflow.executors.celery_executor.execute_command[57077539-e7ea-452c-af03-6393278a2c34] succeeded in 1.7728257849812508s: None Task airflow.executors.celery_executor.execute_command[9aa4a0c5-e310-49ba-a1aa-b0760adfce08] succeeded in 1.7124666879535653s: None ``` After, including change from #11372 ``` Task airflow.executors.celery_executor.execute_command[35822fc6-932d-4a8a-b1d5-43a8b35c52a5] succeeded in 0.5421732050017454s: None Task airflow.executors.celery_executor.execute_command[2ba46c47-c868-4c3a-80f8-40adaf03b720] succeeded in 0.5469810889917426s: None ```	2020-10-13 10:00:16 +01:00
Jarek Potiuk	358e61d7d2	Move the test_process_dags_queries_count test to quarantine (#11455 ) The test (test_process_dags_queries_count) randomly produces bigger number of counts. Example here: https://github.com/apache/airflow/runs/1239572585#step:6:421	2020-10-12 11:48:54 +02:00
Jarek Potiuk	5bc5994c2c	Split tests to more sub-types (#11402 ) We seem to have a problem with running all tests at once - most likely due to some resource problems in our CI, therefore it makes sense to split the tests into more batches. This is not yet full implementation of selective tests but it is going in this direction by splitting to Core/Providers/API/CLI tests. The full selective tests approach will be implemented as part of #10507 issue. This split is possible thanks to #10422 which moved building image to a separate workflow - this way each image is only built once and it is uploaded to a shared registry, where it is quickly downloaded from rather than built by all the jobs separately - this way we can have many more jobs as there is very little per-job overhead before the tests start runnning.	2020-10-11 07:40:31 -07:00
Jarek Potiuk	9416bedf8e	Moving the test to quarantine. (#11405 ) I've seen the test being flaky and failing intermittently several times. Moving it to quarantine for now.	2020-10-10 21:29:42 -07:00
Ash Berlin-Taylor	73b9163a8f	Fully support running more than one scheduler concurrently (#10956 ) * Fully support running more than one scheduler concurrently. This PR implements scheduler HA as proposed in AIP-15. The high level design is as follows: - Move all scheduling decisions into SchedulerJob (requiring DAG serialization in the scheduler) - Use row-level locks to ensure schedulers don't stomp on each other (`SELECT ... FOR UPDATE`) - Use `SKIP LOCKED` for better performance when multiple schedulers are running. (Mysql < 8 and MariaDB don't support this) - Scheduling decisions are not tied to the parsing speed, but can operate just on the database DagFileProcessorProcess: Previously this component was responsible for more than just parsing the DAG files as it's name might imply. It also was responsible for creating DagRuns, and also making scheduling decisions of TIs, sending them from "None" to "scheduled" state. This commit changes it so that the DagFileProcessorProcess now will update the SerializedDAG row for this DAG, and make no scheduling decisions itself. To make the scheduler's job easier (so that it can make as many decisions as possible without having to load the possibly-large SerializedDAG row) we store/update some columns on the DagModel table: - `next_dagrun`: The execution_date of the next dag run that should be created (or None) - `next_dagrun_create_after`: The earliest point at which the next dag run can be created Pre-computing these values (and updating them every time the DAG is parsed) reduce the overall load on the DB as many decisions can be taken by selecting just these two columns/the small DagModel row. In case of max_active_runs, or `@once` these columns will be set to null, meaning "don't create any dag runs" SchedulerJob The SchedulerJob used to only queue/send tasks to the executor after they were parsed, and returned from the DagFileProcessorProcess. This PR breaks the link between parsing and enqueuing of tasks, instead of looking at DAGs as they are parsed, we now: - store a new datetime column, `last_scheduling_decision` on DagRun table, signifying when a scheduler last examined a DagRun - Each time around the loop the scheduler will get (and lock) the next _n_ DagRuns via `DagRun.next_dagruns_to_examine`, prioritising DagRuns which haven't been touched by a scheduler in the longest period - SimpleTaskInstance etc have been almost entirely removed now, as we use the serialized versions * Move callbacks execution from Scheduler loop to DagProcessorProcess * Don’t run verify_integrity if the Serialized DAG hasn’t changed dag_run.verify_integrity is slow, and we don't want to call it every time, just when the dag structure changes (which we can know now thanks to DAG Serialization) * Add escape hatch to disable newly added "SELECT ... FOR UPDATE" queries We are worried that these extra uses of row-level locking will cause problems on MySQL 5.x (most likely deadlocks) so we are providing users an "escape hatch" to be able to make these queries non-locking -- this means that only a singe scheduler should be run, but being able to run one is better than having the scheduler crash. Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>	2020-10-09 22:44:27 +01:00
Kaxil Naik	ff1a2aaff8	Set start_date, end_date & duration for tasks failing without DagRun (#11358 )	2020-10-09 15:21:39 +01:00
Tomek Urbaszek	f697ff2381	Move test tools from tests.utils to tests.test_utils (#10889 )	2020-10-03 14:27:06 +02:00
Jed Cunningham	c74b3ac79a	Optional import error tracebacks in web ui (#10663 ) This PR allows for partial import error tracebacks to be exposed on the UI, if enabled. This extra context can be very helpful for users without access to the parsing logs to determine why their DAGs are failing to import properly.	2020-10-01 21:48:48 +02:00
Ash Berlin-Taylor	59dad1a4ea	Allow CeleryExecutor to "adopt" an orphaned queued or running task (#10949 ) This can happen when a task is enqueued by one executor, and then that scheduler dies/exits. The default fallback behaviour is unchanged -- that queued tasks are cleared and then and then later rescheduled. But for Celery, we can do better -- if we record the Celery-generated task_id, we can then re-create the AsyncResult objects for orphaned tasks at a later date. However, since Celery just reports all AsyncResult as "PENDING", even if they aren't tasks currently in the broker queue, we need to apply a timeout to "unblock" these tasks in case they never actually made it to the Celery broker. This all means that we can adopt tasks that have been enqueued another CeleryExecutor if it dies, without having to clear the task and slow down. This is especially useful as the task may have already started running, and while clearing it would stop it, it's better if we don't have to reset it! Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>	2020-09-16 20:10:30 +01:00
Jarek Potiuk	791f9044fe	Adds the maintain-heart-rate to quarantine. (#10922 ) The test occasionally fails, moving it to quarantine for now.	2020-09-14 10:18:54 +02:00
Ash Berlin-Taylor	63b6e53ffd	Detect orphaned task instances by SchedulerJob id and heartbeat (#10729 ) Once HA mode for scheduler lands, we can no longer reset orphaned task by looking at the tasks in (the memory of) the current executor. This changes it to keep track of which (Scheduler)Job queued/scheduled a TaskInstance (the new "queued_by_job_id" column stored against TaskInstance table), and then we can use the existing heartbeat mechanism for jobs to notice when a TI should be reset. As part of this the existing implementation of `reset_state_for_orphaned_tasks` has been moved out of BaseJob in to BackfillJob -- as only this and SchedulerJob had these methods, and the SchedulerJob version now operates differently	2020-09-10 17:01:41 +01:00
Yingbo Wang	ac943c9e18	[AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart Sensor (#5499 ) Co-authored-by: Yingbo Wang <yingbo.wang@airbnb.com>	2020-09-08 22:47:59 +01:00
Jarek Potiuk	b746f33fc6	Removes stable tests from quarantine (#10768 ) We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (https://github.com/apache/airflow/issues/10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only.	2020-09-08 07:36:12 +02:00
Ash Berlin-Taylor	a01d986f6a	Don't commit when explicitly passed a session to TI.set_state (#10710 ) The `@provide_session` wrapper will already commit the transaction when returned, unless an explicit session is passed in -- removing this parameter changes the behaviour to be: - If session explicitly passed in: don't commit (caller's responsibility) - If no session passed in, `@provide_session` will commit for us already.	2020-09-03 17:18:32 +01:00
Kaxil Naik	9ac882e6cc	[AIRFLOW-5948] Replace SimpleDag with SerializedDag (#7694 )	2020-09-03 16:52:27 +01:00
Jarek Potiuk	7ee7d7cf3f	Move perf_kit to tests.utils (#10470 ) Perf_kit was a separate folder and it was a problem when we tried to build it from Docker-embedded sources, because there was a hidden, implicit dependency between tests (conftest) and perf. Perf_kit is now moved to tests to be avaiilable in the CI image also when we run tests without the sources mounted. This is changing back in #10441 and we need to move perf_kit for it to work.	2020-08-22 21:53:07 +02:00
Tomek Urbaszek	c12e33efa9	Use consistent message in SchedulerJob._process_executor_events (#9929 )	2020-07-27 13:50:50 +02:00
Kaxil Naik	d008ff669d	Rename DagBag.store_serialized_dags to Dagbag.read_dags_from_db (#9838 )	2020-07-15 22:28:04 +01:00
Kaxil Naik	2d124417e6	Fix Writing Serialized Dags to DB (#9836 )	2020-07-15 18:35:59 +01:00
Kaxil Naik	0eb5020fda	Remove unnecessary comprehension (#9805 )	2020-07-14 09:04:14 +01:00
Kamil Breguła	2b12c304f6	Improve typing coverage in scheduler_job.py (#9783 )	2020-07-13 11:11:33 +02:00
Tomek Urbaszek	ecf2f8499b	Use namedtuple for TaskInstanceKeyType (#9712 ) * Use namedtuple for TaskInstanceKeyType	2020-07-10 15:05:51 +02:00
Tomek Urbaszek	4ad3bb53ff	Fix _process_executor_events method to use in-memory try_number (#9692 )	2020-07-07 16:54:43 +02:00
Kaxil Naik	bb19b9179a	Remove side effects from tests (#9675 ) Add setUp and tearDown methods to clear tabels	2020-07-05 22:56:15 +01:00
Kamil Breguła	444051d32c	Fix pylint issues in airflow/models/dagbag.py (#9666 )	2020-07-05 22:45:14 +01:00
Tomek Urbaszek	ce9bad4914	Improve queries number SchedulerJob._process_executor_events (#9488 )	2020-07-02 12:30:17 +02:00
pulsar314	0e31f186d3	Fixes treatment of open slots in scheduler (#9316 ) (#9505 ) Makes scheduler count with number of slots required by tasks. If there's less open slots than required, a task isn't taken to a queue.	2020-06-25 21:42:03 +02:00
crhyatt	c41192fa1f	Upgrade pendulum to latest major version ~2.0 (#9184 )	2020-06-10 17:12:27 +02:00
Tomek Urbaszek	b7627635f4	Query TaskReschedule only if task is UP_FOR_RESCHEDULE (#9087 ) * Query TaskReschedule only if task is UP_FOR_RESCHEDULE * Query for single TaskReschedule when possible * Apply suggestions from code review Co-authored-by: Stefan Seelmann <mail@stefan-seelmann.de> * Adjust mocking in tests * fixup! Adjust mocking in tests * fixup! fixup! Adjust mocking in tests Co-authored-by: Stefan Seelmann <mail@stefan-seelmann.de>	2020-06-09 14:17:13 +02:00
Tomek Urbaszek	533b14341c	Add run_type to DagRun (#8227 ) * Add run_type to DagRun fixup! Add run_type to DagRun fixup! fixup! Add run_type to DagRun fixup! fixup! fixup! Add run_type to DagRun fixup! fixup! fixup! Add run_type to DagRun fixup! Add run_type to DagRun fixup! Add run_type to DagRun Adjust TriggerDagRunOperator fixup! Adjust TriggerDagRunOperator Add index Make run_type not nullable Add type check for run_type fixup! Add type check for run_type * fixup! Add run_type to DagRun * fixup! fixup! Add run_type to DagRun * Fix migration * fixup! Fix migration	2020-06-04 16:20:26 +02:00
Kamil Breguła	2b45d8f0cb	Move TestDagFileProcessorQueriesCount to quarantine (#9119 )	2020-06-03 16:23:01 +02:00
Tomek Urbaszek	11d726dcf1	Add query count test for SchedulerJob (#9088 ) * Add query count test for SchedulerJob * fixup! Add query count test for SchedulerJob	2020-06-03 10:02:02 +02:00
Kamil Breguła	93b8f3e48d	Test queries when number of active DAG Run is not zero (#9082 )	2020-05-31 19:39:22 +02:00
Ash Berlin-Taylor	735bf45de7	Test that DagFileProcessor can operate against on a Serialized DAG (#8739 ) As part of the scheduler HA work we are going to want to separate the parsing from the scheduling, so this changes the tests to ensure that the important methods of DagFileProcessor can do everything the need to when given a SerializedDAG, not just a DAG. i.e. that we have correctly serialized all the necessary fields.	2020-05-30 17:36:53 +01:00
Yingbo Wang	decf7e83d8	Profile hostname for celery executor (#8624 ) Co-authored-by: yingbo_wang <yingbo.wang@airbnb.com>	2020-05-28 17:43:07 -07:00
Tomek Urbaszek	369e6377b4	Add query count test for LocalTaskJob (#8922 ) * Add query count test for LocalTaskJob * fixup! Add query count test for LocalTaskJob	2020-05-28 07:24:46 +02:00
Ash Berlin-Taylor	8ac90b0c4f	[AIRFLOW-5615] Reduce duplicated logic around job heartbeating (#6311 ) Both SchedulerJob and LocalTaskJob have their own timers and decide when to call heartbeat based upon that. This makes those functions harder to follow, (and the logs more confusing) so I've moved the logic to BaseJob	2020-05-27 12:18:30 +01:00
Ash Berlin-Taylor	82de6f74ae	Spend less time waiting for DagFileProcessor processes to complete (#8814 ) In debugging another test I noticed that the scheduler was spending a long time waiting for a "simple" dag to be parsed. But upon closer inspection the parsing process itself was done in a few milliseconds, but we just weren't harvesting the results in a timely fashion. This change uses the `sentinel` attribute of multiprocessing.Connection (added in Python 3.3) to be able to wait for all the processes, so that as soon as one has finished we get woken up and can immediately harvest and pass on the parsed dags. This makes test_scheduler_job.py about twice as quick, and also reduces the time the scheduler spends between tasks . In real work loads, or where there are lots of dags this likely won't equate to much such a huge speed up, but for our (synthetic) elastic performance test dag. These were the timings for the dag to run all the tasks in a single dag run to completion., with PERF_SCHEDULE_INTERVAL='1d' PERF_DAGS_COUNT=1 I also have PERF_SHAPE=linear PERF_TASKS_COUNT=12: Before: 45.4166s After: 16.9499s PERF_SHAPE=linear PERF_TASKS_COUNT=24: Before: 82.6426s After: 34.0672s PERF_SHAPE=binary_tree PERF_TASKS_COUNT=24: Before: 20.3802s After: 9.1400s PERF_SHAPE=grid PERF_TASKS_COUNT=24: Before: 27.4735s After: 11.5607s If you have many more dag files, this likely won't be your bottleneck.	2020-05-15 22:17:55 +01:00
Jarek Potiuk	92585ca4cb	Added automated release notes generation for backport operators (#8807 ) We have now mechanism to keep release notes updated for the backport operators in an automated way. It really nicely generates all the necessary information: * summary of requirements for each backport package * list of dependencies (including extras to install them) when package depends on other providers packages * table of new hooks/operators/sensors/protocols/secrets * table of moved hooks/operators/sensors/protocols/secrets with information where they were moved from * changelog of all the changes to the provider package (this will be automatically updated with incremental changelog whenever we decide to release separate packages. The system is fully automated - we will be able to produce release notes automatically (per-package) whenever we decide to release new version of the package in the future.	2020-05-15 19:00:15 +02:00
James Timmins	4813b94ec5	Create log file w/abs path so tests pass on MacOS (#8820 ) * Set conf vals as env vars so spawned process can access values. * Create custom env_vars context manager to control simple environment variables. * Use env_vars instead of conf_vars when using . * When creating temporary environment variables, remove them if they didn't exist.	2020-05-14 23:17:54 +01:00
QP Hou	81fb9d64ad	Add metric for monitoring email notification failures (#8771 )	2020-05-13 19:41:26 +01:00
Ash Berlin-Taylor	c3af681edf	Convert tests/jobs/test_base_job.py to pytest (#8856 ) I would like to (create) and use a pytest fixture as a parameter, but they cannot be used on unittest.TestCase functions: > unittest.TestCase methods cannot directly receive fixture arguments as > implementing that is likely to inflict on the ability to run general > unittest.TestCase test suites.	2020-05-13 14:00:46 +01:00
Kaxil Naik	3ad4f96bae	[AIRFLOW-1156] BugFix: Unpausing a DAG with catchup=False creates an extra DAG run (#8776 )	2020-05-11 22:25:45 +01:00
James Timmins	f410d64de5	Use fork when test relies on mock.patch in parent process. (#8794 ) * Use 'fork' in test bc 'spawn' breaks mocks. * Use fork when making process w test_scheduler_executor_overflow.	2020-05-11 21:42:38 +01:00

1 2 3

130 Коммитов