- Adds a new "infrastructure" section to the docs, which describes
architecture, administration and troubleshooting (fixes bug 1165259).
- Adds code comments to any deployment-related files in the repository.
- Adds documentation for the various ways in which users can access
Treeherder data (fixes bug 1335172).
- Reorganises the structure of some of the existing non-infrastructure
docs, to make the documentation easier to navigate.
In #4709 push ingestion was changed to use a new `store_pulse_pushes`
queue (since we now prefer the term "push" instead of "resultset"), with
the old queue left behind to ensure any tasks in it at the time of
deployment would still be run.
Now that the old queue is empty it can be removed.
Runnable jobs for buildbot were calculated via a celerybeat task
(that was disabled in #4007) and the results stored in the
`runnable_jobs` table. This can all be removed now that buildbot is
EOL, since the remaining support for Taskcluster runnable jobs does
not use that celery task/Django model.
Since as of #3980 (bug 1470622) the frontend no longer calls the
`/retrigger/` `/cancel/` or `/cancel_all/` Treeherder APIs.
Whilst looking at the pulse related fixtures, I spotted that the
`mock_message_broker` fixture was already unused.
Now that consumers of OrangeFactor have been switched to the new
intermittent failures view UI/API, we can stop submitting failure
classifications to OrangeFactor's Elasticsearch instance.
The UI has already been removed. This cleans up the data ingestion
and removes the `JobDuration` model, however leaves the `running_eta`
field on the `Job` model for the next time that table is touched (since
the table is large, so altering the schema would likely require
downtime).
To ensure that:
* the Heroku router's 30 second timeout doesn't beat gunicorn to it,
given that routing time is included in Heroku's timing
* the app is more likely to remain responsive, when receiving many
badly filtered API calls
Change of new environment variable `PULSE_PUSH_SOURCES`.
Keep old `publish-resultset-runnable-job-action` task name by creating a
method that points to `publish_push_runnable_job_action`.
Now we're no longer using datasource, the memory leaks previously
seen have gone. As such there is no need to make the celery worker
processes restart so frequently (they will now only be restarted at
the daily Heroku dyno restart), which will improve performance.
In addition, this will reduce the number of transactions that don't get
reported to New Relic (when the maxtasksperchild threshold is
reached, the worker is forcibly killed before it can submit the
New Relic payload).
The log parser tasks have been left unchanged for now, since they
appear to still be leaking.
Since hopefully now we're no longer using datasource, the leaks should
have gone. The gunicorn processes will now only be restarted at the
daily Heroku dyno restart, rather than multiple times per minute,
improving performance.
Outputting to the console rather than a log file:
* is more user-friendly during development
* is more consistent with Heroku
* means the Vagrant-specific Django LOGGING config is now closer to the
one in settings.py, and so more easily combined with it
Both gunicorn and celery default to outputting to stdout/stderr, so the
`logfile` options can be omitted entirely.
This changes ingestion, the API endpoints, and the frontend to match
the new structure. For now we continue to store text_log_summary artifacts,
though they don't do anything anymore.
* Bug 1264074 - Move to_timestamp function to a reusable location
* Bug 1264074 - Refactor JobConsumer to have a PulseConsumer super class
Much of what was in the JobConsumer is reusable by the upcoming
ResultsetConsumer. So refactor those parts out so that each specific
consumer can reuse code as much as possible.
* Bug 1264074 - Add ability to ingest Github Resultsets via Pulse
This introduces a ResultsetConsumer and a read_pulse_resultsets
management command to ingest resultsets from the TaskCluster
github exchanges.
When a supported Github repo has a Pull Request created or
updated, or a push is made to master, then it will kick off a
Pulse message. We will receive it and then fetch any additional
information we need from github's API and store the Resultset.
This follows a very similar pattern to the Job Pulse ingestion.
* Bug 1264074 - Old code/comments cleanup
* Bug 1264074 - Tests for the Github resultset pulse loader
Rename ``ingest_from_pulse`` management command to ``read_pulse_jobs`` to
indicate that this step does not actually do any ingesting. It just populates
the celery queue ``store_pulse_jobs`` that DOES do the actual ingesting.
Since we'll soon be adding reporting deploys to New Relic, which will be
too verbose to include in the Procfile. Also adds additional log output
(which follows the buildpack compile log formatting convention) to make
it easier to find & follow the release tasks on Papertrail.
Uses the `set -euo pipefail` recommendation from:
http://redsymbol.net/articles/unofficial-bash-strict-mode/
The beta release-phase feature is making a backwards incompatible change
today - moving from an app.json (and corresponding buildpack) approach
to specifying a `release` Procfile entry that blocks the app deploy.
For more info, see:
https://devcenter.heroku.com/articles/release-phase?preview=1
Since we're not calculating ETAs (the UI does that once it knows the
start time and expected duraction), we're calculating recent average
durations instead.
This creates a 'runnable_job' table in the database, as well as an API
endpoint at /api/project/{branch}/runnable_jobs listing all existing
buildbot jobs and their symbols. A new daily task 'fetch_allthethings' is
added to update the this table.
The MPL 2.0 terms state that as long as a LICENSE file is present, the
per-file header text is not required. See "Exhibit A" at the end of:
https://www.mozilla.org/MPL/2.0/
After the previous commit, the Objectstore is effectively "dead code".
So this commit removes all the dead code after anything left over in
the Objectstore has been drained and added to the DB.
By taking the concurrency down to 1, we can control how much memory is
used and increase throughput by assigning more dynos. This prevents us
overflowing the memory allocations on each dyno due to the growth of
the build4hr process.
Previously all three of the buildapi ETL ingestion tasks (pending,
running, 4hr) were run under one queue. In order to be able to tell
issues (such as backlogs or leaks) apart, these have now been split onto
their own queues.
On Heroku, these queues now also have a dyno each - which should mean we
can easily tell which is leaking and possibly also downgrade from the
expensive performance dyno even before the leak is fixed.
I added a Procfile listing all the different python services treeherder needs.
Heroku provides deployment-specific settings via environment variables, so I had to modify the settings file to listen to them where that wasn't the case. I created an enviroment variable IS_HEROKU which allows to have a heroku-only configuration where needed.
The db service is provided by Amazon RDS, which requires a ssl connection. To enable ssl in the MySQLdb python client I had to modify Datasource (and bump up the version used).
The cache service is provided by the memcachier heroku addon. Heroku recommends to use pylibmc, so I set it up according to the docs here https://devcenter.heroku.com/articles/memcachier#python.
The amqp service is provided by the CloudAMQP addon.
I added a post_compile script that runs every time we deploy. We should run every build step we require in there, like static asset minification, collection, etc.
To share the oauth credentials among the various services I used an environment variable. I also added an option to export_project_credentials so that the credentials can be printed to stdout. This should come handy when we will need to update the environment-stored credentials with the ones in the db.