The Python buildpack has now rewritten the automatic collectstatic
feature, such that it no longer does an unnecessary (and time-consuming)
dry-run every time. As such, we can switch back to the buildpack's
automatic collectstatic:
https://github.com/heroku/heroku-buildpack-python/blob/master/bin/steps/collectstatic
Prior to this landing, I'll update us to the latest version of the
buildpack (using the `heroku buildpacks:set -i X ...` command) and
remove the `DISABLE_COLLECTSTATIC` environment variable.
[ci skip]
Since once the `grunt build` has run they are no longer required, and
only serve to bloat the slug size, increase attack surface & overwrite
the Python .profile.d script's environment variables.
Since we're not calculating ETAs (the UI does that once it knows the
start time and expected duraction), we're calculating recent average
durations instead.
This creates a 'runnable_job' table in the database, as well as an API
endpoint at /api/project/{branch}/runnable_jobs listing all existing
buildbot jobs and their symbols. A new daily task 'fetch_allthethings' is
added to update the this table.
The Python buildpack's automatic collectstatic is slower, since it does
a `--dry-run` first. To avoid the time penalty of this, we disable it by
setting `DISABLE_COLLECTSTATIC` in the Heroku environment, and run
collectstatic manually in `bin/post_compile`. See:
https://github.com/heroku/heroku-buildpack-python/issues/252
This commit relies on the nodejs buildpack being added to the list of
buildpacks for the app, and prior to the Python buildpack. See:
https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app
The nodejs buildpack will automatically install the packages listed in
`dependencies` in package.json, so that we have the requirements for
the grunt build. We don't actually need node or all of the files in
node_modules after we've run the grunt build, so in the future could try
and remove them to reduce the resultant slug size (though it only
increased from 55MB to 70MB, so it's not urgent).
The dist directory has been added to `.slugignore` to prevent the
in-repo directory from being uploaded, since we'll be generating a new
one as part of the deploy. Once `dist/` is deleted from master, that
entry can be removed.
This adds an autoclassify command and a detect_intermittents command.
The former is designed to take an incoming job with an error summary
and look for existing results marked as intermittent that are a close
match for the new result. At present only one matcher is implemented;
this requires an exact match in terms of test name, result and error
message. Matching is also constrained to be based on single lines; it
is anticipated that future iterations may add support for matching on
groups of lines.
The detect_intermittents command is designed to take a group of jobs
running on the same push and with the same build job (i.e. same
testsuite, same chunk, etc.) and look for new intermittents to add to
the database. This currently only looks for test failures where there
is at least one green job and one non-green job.
There is currently no UI for seeing matches or for adding new
prototypical intermittents as match candidates. There is also no
integration with bugzilla; future development should add association
of frequent intermittents with bugs.
`IS_HEROKU` isn't something that will differ between stage/prod, and is
not going to change any time soon. So let's just get post_compile to
add it to the dyno profile script, so it's one less variable to have to
remember to set via the Heroku CLI/dashboard.
Quoting from:
http://blog.doismellburning.co.uk/2014/10/06/twelve-factor-config-misunderstandings-and-advice/
"12factor says your applications should read their config from the
environment; it has very little to say about how you populate the
environment ? use whatever works for you".
On Heroku, there is no load balancer or Varnish-like cache in front of
gunicorn, so we must handle gzipping responses in the app.
In order for WhiteNoise to serve gzipped static content, assets must be
gzipped on disk in advance (doing so on-demand in Python would not be
as performant). WhiteNoise will then serve the `.gz` version of files in
preference to the original, if the client indicated it supported gzip.
For assets covered by Django's collectstatic, gzipping the assets only
requires using WhiteNoise's GzipManifestStaticFilesStorage backend,
which wraps Django's ManifestStaticFilesStorage to create hashed+gzipped
versions of static assets:
http://whitenoise.evans.io/en/latest/django.html#add-gzip-and-caching-support
The collectstatic generated files will then contain the file hash in
their filename, so WhiteNoise can also serve them with a large max-age
to avoid further requests if the file contents have not changed.
For the UI files under `dist/`, we cannot rely on the Django storage
backend, since the directory isn't covered by STATICFILES_DIRS (it is
instead made known to WhiteNoise via `WHITENOISE_ROOT`). As such, files
under `dist/` are gzipped via an additional step during deployment. See:
http://whitenoise.evans.io/en/latest/base.html#gzip-support
Files whose extension is on the blacklist, or that are not >5% smaller
when compressed, are skipped during compression.
The MPL 2.0 terms state that as long as a LICENSE file is present, the
per-file header text is not required. See "Exhibit A" at the end of:
https://www.mozilla.org/MPL/2.0/
After the previous commit, the Objectstore is effectively "dead code".
So this commit removes all the dead code after anything left over in
the Objectstore has been drained and added to the DB.
Since it only speeds up parsing by a few percent of total runtime, and
is therefore not worth the added complexity for deployment and local
hack-test-debug cycles when working on the log parser.
The .gitignore and update.py entries will be removed in a later commit,
once the stage/prod src directories have been cleaned up.
Previously all three of the buildapi ETL ingestion tasks (pending,
running, 4hr) were run under one queue. In order to be able to tell
issues (such as backlogs or leaks) apart, these have now been split onto
their own queues.
On Heroku, these queues now also have a dyno each - which should mean we
can easily tell which is leaking and possibly also downgrade from the
expensive performance dyno even before the leak is fixed.
This introduces two new ways to generate ``Bug suggestions`` artifacts from
a ``text_log_summary`` artifact
1. POST a ``text_log_summary`` on the ``/artifact`` endpoint
2. POST a ``text_log_summary`` with a job on the ``/jobs`` endpoint.
Both of these cases will schedule an asynchronous task to generate the
``Bug suggestions`` artifact with ``celery``.
Artifact generation scenarios:
JobCollections
^^^^^^^^^^^^^^
Via the ``/jobs`` endpoint:
1. Submit a Log URL with no ``parse_status`` or ``parse_status`` set to "pending"
* This will generate ``text_log_summary`` and ``Bug suggestions`` artifacts
* Current *Buildbot* workflow
2. Submit a Log URL with ``parse_status`` set to "parsed" and a ``text_log_summary`` artifact
* Will generate a ``Bug suggestions`` artifact only
* Desired future state of *Task Cluster*
3. Submit a Log URL with ``parse_status`` of "parsed", with ``text_log_summary`` and ``Bug suggestions`` artifacts
* Will generate nothing
ArtifactCollections
^^^^^^^^^^^^^^^^^^^
Via the ``/artifact`` endpoint:
1. Submit a ``text_log_summary`` artifact
* Will generate a ``Bug suggestions`` artifact if it does not already exist for that job.
2. Submit ``text_log_summary`` and ``Bug suggestions`` artifacts
* Will generate nothing
* This is *Treeherder's* current internal log parser workflow
I added a Procfile listing all the different python services treeherder needs.
Heroku provides deployment-specific settings via environment variables, so I had to modify the settings file to listen to them where that wasn't the case. I created an enviroment variable IS_HEROKU which allows to have a heroku-only configuration where needed.
The db service is provided by Amazon RDS, which requires a ssl connection. To enable ssl in the MySQLdb python client I had to modify Datasource (and bump up the version used).
The cache service is provided by the memcachier heroku addon. Heroku recommends to use pylibmc, so I set it up according to the docs here https://devcenter.heroku.com/articles/memcachier#python.
The amqp service is provided by the CloudAMQP addon.
I added a post_compile script that runs every time we deploy. We should run every build step we require in there, like static asset minification, collection, etc.
To share the oauth credentials among the various services I used an environment variable. I also added an option to export_project_credentials so that the credentials can be printed to stdout. This should come handy when we will need to update the environment-stored credentials with the ones in the db.
We're no longer using the vendor directory & this script wasn't entirely
reliable anyway, so let's remove it. The virtualenv package can be
removed from dev.txt, since virtualenv is installed globally, and
nothing inside our virtualenv (which is where the packages in dev.txt
end up) needs a local installation of it.
In order to rsync the virtualenv as part of deployment, we'll need to
use |virtualenv --relocatable|. However this does not update the
activate scripts (making them unusable), and the changes it makes in the
virtualenv bin directory require that the virtualenv's python is first
in the path:
https://virtualenv.pypa.io/en/latest/userguide.html#making-environments-relocatable
As such, we must make the Treeherder runner scripts add the virtualenv
bin to PATH instead of using the activate pattern.
The existing name "curr_dir" isn't overly clear - does it mean the
current working directory or the directory in which this script exists?
It should also be uppercase. I think the behaviour is clearer when
reworked like this :-)
There's a lot of other cleanup that could be done (eg reducing
duplication), but saving that for bug 1153971.
Unlike the Pulse publishing, the code for consuming data from Pulse is
unused, being a leftover from initial attempts to ingest buildbot data
via pulse, rather than builds-{4hr,running,pending}.js
...into 'classification_mirroring' and 'publish_to_pulse'. This gives
greater visibility into the relative queue sizes of each, allows us to
move one of them to another node, plus means we can pause consumption of
the classification_mirroring task for when the Elasticsearch indicies
are migrated in bug 1142538.
Use prefork scheduling instead of gevent scheduling, to avoid issues
we've had with gevent - both with zombie tasks & also incompatibilities
with Python 2.7.9.
We want to start using peep in production, to alleviate security
concerns with the idea of auto-updating packages from PyPI on deploy.
As a first step, we switch to using peep in the Vagrant environment,
on Travis and in the Docker build - so we can confirm the hashes are
correct.
Close bug 1143350.
We're checking this in so we have a known good starting point in the
chain of trust. It also simplifies our deployment requirements.
peep.py was taken from:
https://github.com/erikrose/peep/archive/2.2.tar.gz
The only alteration made was the addition of the licence block at the
top of the file, taken from LICENCE in the peep repo.
Traceback (most recent call last):
File "./bin/download_logs.py", line 97, in <module>
download_logs(sys.argv[1])
File "./bin/download_logs.py", line 74, in download_logs
logrefs = job["job"]["log_references"]
TypeError: string indices must be integers, not str
And I don't think we have a need for it now anyway.