Occasionally failing build/test runs can fail in such a way that results
in a significant amount of log spam and therefore log files that are
hundreds of MB in size each. This can cause log parsing backlogs,
particularly when many jobs on the same push fail in such a way.
The log parser now checks the `Content-Length` of log files prior to
streaming them, and skips the download/parse if it exceeds the set
threshold. The frontend has been adjusted to display an appropriate
message explaining why the parsed log is not available.
The threshold has been set to 5MB, since:
* the 99th percentile of download size on New Relic was ~2.8MB:
https://insights.newrelic.com/accounts/677903/dashboards/339080
* `Content-Length` is the size of the log prior to decompression, and
the chronic logspam cases have been known to have compression ratios
of 20-50x, which would translate to an uncompressed size limit of
up to 250MB (which is already much larger than buildbot's former 50MB
uncompressed size limit).
Previously any exceptions raised whilst loading the expected output JSON
fixtures were suppressed, which made debugging the Python 3 test failures
harder than needs be.
The reason failures were suppressed was to allow the test to continue far
enough that the actual output could be saved to the fixture when creating
new tests. However reordering `do_test()` has the same effect without the
need for the `load_exp()` try-except handling.
Since the `request` package's `iter_lines()` returns bytes by default.
We cannot pass `decode_unicode=True` to `iter_lines()` since it splits on
Unicode newlines (which can exist in test message output), so instead we
manually `.decode()` each line before parsing it.
Fixes the following exception under Python 3:
`TypeError: a bytes-like object is required, not 'str'`
The test utility `load_exp()` had to be modified to no longer use append
mode when opening the expected output JSON files, in order to fix:
`json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)`
pytest treats objects starting with the string "Test" as tests, so an
underscore prefix has been added to prevent warnings of form:
```
WC1 .../test_detect_intermittents.py cannot collect test class 'TestFailureDetector' because it has a __init__ constructor
```
Change of new environment variable `PULSE_PUSH_SOURCES`.
Keep old `publish-resultset-runnable-job-action` task name by creating a
method that points to `publish_push_runnable_job_action`.
This assumes that any line involving the words REFTEST ERROR is always
a failure. It is possible that we will need restrict this if there are
passing tests that produce that in their output.
Now that we're using requests, the log URL being used to access the file
is consistent across environments (since it doesn't reference the local
directory structure), so we don't need to exclude it from comparisons.
Doing so also fixes incorrect line numbers in logs that have Windows
line endings (bug 1328880), hence having to update the expected logview
output.
The tests have to be adjusted to use responses, since requests doesn't
support `file://` URLs.
This adds a FailureLine foreign key to TextLogError and duplicates the
best_classification and best_is_verified columns. Classifications are
still set on FailureLine and copied to the corresponding column on
TextLogError.
This is a prelude to future work which will remove classifications from
FailureLine altogether, so that all autoclassification data can be
retrieved from TextLogError.
This adds a FailureLine foreign key to TextLogError and duplicates the
best_classification and best_is_verified columns. Classifications are
still set on FailureLine and copied to the corresponding column on
TextLogError.
This is a prelude to future work which will remove classifications from
FailureLine altogether, so that all autoclassification data can be
retrieved from TextLogError.
Now that log information is in the main treeherder database use the id
information there as the input to the log parsing and autoclassification
tasks.
As part of the same change there is a mild refactor of the
autoclassification related tasks to not go through the call_command
infrastructure as this is just unnecessary overhead.
Instead, generate the data when required. We will store the return value
of this in memcache for a day to ensure things are responsive for the sheriffs
when classifying recent failures.
Upcoming changes in bug 1295380 will change log output for many tests
in TaskCluster resulting in lines being prefixed with a timestamp and
the task step name. Unless these line prefixes are handled, it
confuses the error parser.
This commit detects logs as coming from TaskCluster and strips their
line prefix accordingly.
This changes ingestion, the API endpoints, and the frontend to match
the new structure. For now we continue to store text_log_summary artifacts,
though they don't do anything anymore.
Upcoming changes in bug 1295380 will change log output for many tests
in TaskCluster resulting in lines being prefixed with a timestamp and
the task step name. Unless these line prefixes are handled, it
confuses the error parser.
This commit detects logs as coming from TaskCluster and strips their
line prefix accordingly.