AMBROSIA/AKS-scripts/README.md

4.4 KiB

AMBROSIA on Kubernetes, README & Quick Start

This directory contains scripts for launching an AMBROSIA service on a Kubernetes cluster, and in particular, an Azure Kubernetes (AKS) cluster.

These scripts are launched from a development machine, which is currently assumed to be running Linux, but which in the future may allow the same Bash scripts to be run on Windows or Mac. Running the scripts will locally build Docker containers before pushing them to the cloud, and thus it assumes some prerequisites:

  • Azure CLI 2.0
  • Kubernetes CLI (kubectl)
  • Docker command line tools

The main entrypoint is "run-end-to-end-test.sh", which is designed to be modified to suite your application. It is initially configured to build and deploy the InternalImmortals/PerformanceTestInterruptable application, which consists of two pods that communicate with eachother to test the performance of the RPC channel.

The other scripts in this directory automate various aspects of deploying on AKS, including the authentication steps. It is designed to use with a fresh Azure resource group.

Step 1: configure your deployment

Use the provided template:

cp Defs/AmbrosiaAKSConf.sh.template Defs/AmbrosiaAKSConf.sh
$EDITOR Defs/AmbrosiaAKSConf.sh

Fill in your Azure subscription identifier and select the name of a new resource group which will be created. (It is best to isolate this from other Azure resources you may have running.)

Step 2: provision and run

./run-end-to-end-test.sh

That's it! The first time this script runs it will take a long time to provision the storage account, container registry, file share, and Kubernetes cluster. Running it again is safe, and it will run faster once these things have already been created.

At the end of all that kubectl get pods should show two running pods. Note that while the steps taken by ./run-end-to-end-test.sh should be idempotent, it does not provide safe "incremental builds" when things change. Therefore you should take it upon yourself to clean up when modifying the Azure/AKS-related (e.g. the contents of AmbrosiaAKSConf.sh):

./Clean-AKS.sh <all|most|auth>

In contrast, changes to the application logic -- confined to the application Dockerfile and resulting Docker container -- do not require cleaning. They can be rerun on the previously provisioned Azure resources, simply by reexecuting the run-end-to-end-test script.

Note if you are RERUNNING this PerformanceTestInterruptable or another application then you typically want to delete the logs that are mounted into the Azure Files SMB share (mounted at /ambrosia_logs/ by default). You can do this manually, or Clean-AKS.sh most is sufficient to do it.

Step 3: viewing the output

Use the name shown in kubectl get pods to print the logs of the client container:

kubectl logs -f generated-perftestclient-******* 

Eventually this will show lines of outputs that contain performance measurements:

*X* 65536       0.0103833476182065
*X* 32768       0.00985864082367517

These show throughput in GiB/s for a given message size. When you're done, you can use Clean-AKS.sh all to delete the entire resource group (or do it yourself in the Azure web portal).

Step 4: testing virtual resiliency, aka IMMORTALITY

Ambrosia, being the nectar of the gods, confers immortality on the processes running on it. What this means is that if you kill a container/pod, it will be able to restart and use durable storage to recover exactly the state it was at before.

In order to demonstrate this with the sample application we will emulate a crashed machine by killing the client application. Start the application, then, while it is running, issue this command inside the container to kill the client:

kubectl exec -it generated-perftestclient-******* bash 
kill $(pidof Job)

If you look at kubectl get pods --watch, you will see Kubernetes attempting to automatically restart the container, which will itself go down as soon as the main executable (Job) is killed.

(Note, perftestclient may actually take several tries to start back up, because the quick attempts to restart in the same pod hit an System.Net.Sockets.SocketException: Address already in use error. A real failure that brought down a machine, and thus one or more pods, would not have this problem.)