Bug 1486071 - Retry docker-image and packages tasks that fail during apt-get. r=dustin

When apt-get fails, it has a distinctive error code (100). Most of the
time, when apt-get fails, it's because of some network error, or
possibly some problem unpacking archives. When that happens, retrying
the task usually "fixes" the issue.

One of the (currently) most common causes of problems is
snapshot.debian.org not being available to some of the EC2 instances.

It would be possible to only set things up so that we only retry when we
detect such setup (checking the public IP of the instance is not in the
known list of problematic IPs), but that would require possibly wrapping
apt-get, or something along those line, which is not entirely trivial to
do for the packages tasks, because they don't rely on docker images.

However, since there aren't many apt-get failures other than these,
and since there have been, historically, some intermittent apt-get
failures of a different nature that were solved by re-running the tasks,
it seems fair to just retry wheneven apt-get fails.

One downside of the approach is that if for some reason a change to a
Dockerfile ends up mentioning a package that doesn't exist, that too
will result in multiple retries ; which might be inconvenient, but
that's not something that's going to happen often.

Differential Revision: https://phabricator.services.mozilla.com/D11420

--HG--
extra : moz-landing-system : lando
This commit is contained in:
Mike Hommey 2018-11-13 22:17:14 +00:00
Родитель dc6ff756f2
Коммит 951d78513a
3 изменённых файлов: 13 добавлений и 1 удалений

Просмотреть файл

@ -179,6 +179,8 @@ def fill_template(config, tasks):
'docker-in-docker': True,
'taskcluster-proxy': True,
'max-run-time': 7200,
# Retry on apt-get errors.
'retry-exit-status': [100],
},
}
# Retry for 'funsize-update-generator' if exit status code is -1

Просмотреть файл

@ -82,6 +82,8 @@ def docker_worker_debian_package(config, job, taskdesc):
repo=docker_repo,
dist=run['dist'],
date=run['snapshot'][:8])
# Retry on apt-get errors.
worker['retry-exit-status'] = [100]
add_artifacts(config, job, taskdesc, path='/tmp/artifacts')

Просмотреть файл

@ -104,7 +104,15 @@ def post_to_docker(tar, api_path, **kwargs):
elif 'stream' in data:
sys.stderr.write(data['stream'])
elif 'error' in data:
raise Exception(data['error'])
sys.stderr.write('{}\n'.format(data['error']))
# Sadly, docker doesn't give more than a plain string for errors,
# so the best we can do to propagate the error code from the command
# that failed is to parse the error message...
errcode = 1
m = re.search(r'returned a non-zero code: (\d+)', data['error'])
if m:
errcode = int(m.group(1))
sys.exit(errcode)
else:
raise NotImplementedError(repr(data))
sys.stderr.flush()