Occasional docker-compose errors will be easier to diagnose (#11835)

With this change we attempt to better diagnose some occasional
network docker-compose issues that have beeen plaguing us after
we solved or workarounded other CI-related issues. Sometimes
the docker compose jobs fail on checking if the container is
up and running with either of the two errors:

 * 'forward host lookup failed: Unknown host`
 * 'DNS fwd/rev mismatch'

Usually this happens in rabbitMQ and openldap containers.

Both indicate a problem with DNS of the docker engine or maybe
some remnants of the previous docker run that do not allow us
to start those containers.

This change introduces few improvements:

* added --volume in `docker system prune` command which might
  clean-up some anonymous volumes left by the containers between
  runs

* removed docker-compose down --remove-orphans --down command
  after failure, as currently we are anyhow always doing it
  few lines before (before the test). This change will cause
  that our mechanism of logging container logs after failure
  will likely give us more information about in case the root
  cause is rabbitmq or openldap container failing to start

* Increases number of tries to 5 in case of failed containers.
This commit is contained in:
Jarek Potiuk 2020-10-26 17:21:21 +01:00 коммит произвёл GitHub
Родитель a5d3176878
Коммит 2f4a3d48a8
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
2 изменённых файлов: 6 добавлений и 12 удалений

Просмотреть файл

@ -31,17 +31,17 @@ function run_airflow_testing_in_docker() {
set +u
set +e
local exit_code
for try_num in {1..3}
for try_num in {1..5}
do
echo
echo "Making sure docker-compose is down"
echo "Making sure docker-compose is down and remnants removed"
echo
docker-compose --log-level INFO -f "${SCRIPTS_CI_DIR}/docker-compose/base.yml" \
down --remove-orphans --volumes --timeout 10
echo
echo "System-prune docker"
echo
docker system prune --force
docker system prune --force --volumes
echo
echo "Check available space"
echo
@ -70,15 +70,9 @@ function run_airflow_testing_in_docker() {
echo "Delete kerberos network"
kerberos::delete_kerberos_network
fi
if [[ ${exit_code} == 254 ]]; then
if [[ ${exit_code} == "254" && ${try_num} != "5" ]]; then
echo
echo "Failed starting integration on ${try_num} try. Wiping-out docker-compose remnants"
echo
docker-compose --log-level INFO \
-f "${SCRIPTS_CI_DIR}/docker-compose/base.yml" \
down --remove-orphans -v --timeout 5
echo
echo "Sleeping 5 seconds"
echo "Failed try num ${try_num}. Sleeping 5 seconds for retry"
echo
sleep 5
continue

Просмотреть файл

@ -21,5 +21,5 @@
sudo swapoff -a
sudo rm -f /swapfile
sudo apt clean
docker system prune --all
docker system prune --all --force
df -h