01827290f6 | ||
---|---|---|
docs | ||
src | ||
.editorconfig | ||
.gitignore | ||
CODE_OF_CONDUCT.md | ||
CodeCoverage.runsettings | ||
CromwellOnAzure.sln | ||
GeoPol.xml | ||
LICENSE | ||
README.md | ||
SECURITY.md |
README.md
Welcome to Cromwell on Azure
Latest release
Check the "Update Instructions" section in the version 2.3.0 release notes to learn how to update an existing Cromwell on Azure deployment to version 2.3.0. You can customize some parameters when updating. Please see these customization instructions, specifically the "Used by update" and "Comment" columns in the table.
Getting started
- What is Cromwell on Azure?
- Deploy Cromwell on Azure now using this guide
- A brief demo video on how to run workflows using Cromwell on Azure
Running workflows
- Prepare, start or abort your workflow using this guide
- Here is an example workflow to convert FASTQ files to uBAM files
- Have an existing WDL file that you want to run on Azure? Modify your existing WDL with these adaptations for Azure
- Want to run commonly used workflows? Find links to ready-to-use workflows here
- Want to see some examples of tertiary analysis or other genomics analysis? Find links to related project here
Questions?
- See our Troubleshooting Guide for more information
- Known issues and work-arounds are documented here
If you are running into an issue and cannot find any information in the troubleshooting guide, please open a GitHub issue!
Cromwell on Azure
Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, Cromwell is also used in the GATK Best Practices genome analysis pipeline. Cromwell supports running scripts at various scales, including your local machine, a local computing cluster, and on the cloud.
Cromwell on Azure configures all Azure resources needed to run workflows through Cromwell on the Azure cloud, and uses the GA4GH TES backend for orchestrating the tasks that create a workflow. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow. Cromwell workflows can be written using either the WDL or the CWL scripting languages. To see examples of WDL scripts - see this 'Learn WDL' repository on GitHub. To see examples of CWL scripts - see this 'CWL search result' on Dockstore.
Deploy your instance of Cromwell on Azure
Prerequisites
- You will need an Azure Subscription to deploy Cromwell on Azure.
- You must have the proper Azure role assignments to deploy Cromwell on Azure. To check your current role assignments, please follow these instructions. You must have one of the following combinations of role assignments:
Owner
of the subscriptionContributor
andUser Access Administrator
of the subscriptionOwner
of the resource group. Note: this level of access will result in a warning during deployment, and will not use the latest VM pricing data. Learn more. Also, you must specify the resource group name during deployment with this level of access (see below).- Note: if you only have
Service Administrator
as a role assignment, please assign yourself asOwner
of the subscription.
- Install the Azure Command Line Interface (az cli), a command line experience for managing Azure resources.
- Run
az login
to authenticate with Azure.
Download the deployment executable
Download the required executable from Releases. Choose the runtime of your choice from win-x64
, linux-x64
, osx-x64
. On Windows machines, we recommend using the win-x64
runtime (deployment using the linux-x64
runtime via the Windows Subsystem for Linux is not supported).
Optional: build the executable yourself
Note: Build instructions only provided for the latest release.
Linux
Preqrequisites:
.NET Core 3.1 SDK for Linux. Get instructions for your Linux distro and version to install the SDK.
For example, instructions for Ubuntu 18.04 are available here and below for convenience:
wget https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update && \
sudo apt-get install -y apt-transport-https && \
sudo apt-get update && \
sudo apt-get install -y dotnet-sdk-3.1
Windows
Preqrequisites:
.NET Core 3.1 SDK for Windows. Get the executable and follow the wizard to install the SDK.
Recommended:
VS 2019
Build steps
- Clone the Cromwell on Azure repository
- Build the solution using
dotnet build
on bash or Powershell. For Windows, you can choose to build and test using VS 2019 - Run tests using
dotnet test
on bash or Powershell - Publish the
deploy-cromwell-on-azure
project as a self-contained deployment with your target runtime identifier (RID) to produce the executable
Example
Linux: dotnet publish -r linux-x64
Windows: dotnet publish -r win-x64
Learn more about dotnet
commands here
Run the deployment executable
- Linux and OS X only: assign execute permissions to the file by running the following command on the terminal:
chmod +x <fileName>
. Replace<fileName>
with the correct name:deploy-cromwell-on-azure-linux
ordeploy-cromwell-on-azure-osx.app
- You must specify the following parameters:
SubscriptionId
(required)- This can be obtained by navigating to the subscriptions blade in the Azure portal
RegionName
(required)- Specifies the region you would like to use for your Cromwell on Azure instance. To find a list of all available regions, run
az account list-locations
on the command line or in PowerShell and use the desired region's "name" property forRegionName
.
- Specifies the region you would like to use for your Cromwell on Azure instance. To find a list of all available regions, run
MainIdentifierPrefix
(optional)- This string will be used to prefix the name of your Cromwell on Azure resource group and associated resources. If not specified, the default value of "coa" followed by random characters is used as a prefix for the resource group and all Azure resources created for your Cromwell on Azure instance. After installation, you can search for your resources using the
MainIdentifierPrefix
value.
- This string will be used to prefix the name of your Cromwell on Azure resource group and associated resources. If not specified, the default value of "coa" followed by random characters is used as a prefix for the resource group and all Azure resources created for your Cromwell on Azure instance. After installation, you can search for your resources using the
ResourceGroupName
(optional, required when you only have owner-level access of the resource group)- Specifies the name of a pre-existing resource group that you wish to deploy into.
Run the following at the command line or terminal after navigating to where your executable is saved:
.\deploy-cromwell-on-azure.exe --SubscriptionId <Your subscription ID> --RegionName <Your region> --MainIdentifierPrefix <Your string>
Example:
.\deploy-cromwell-on-azure.exe --SubscriptionId 00000000-0000-0000-0000-000000000000 --RegionName westus2 --MainIdentifierPrefix coa
A test workflow is run to ensure successful deployment. If your Batch account does not have enough resource quotas, you will see the error while deploying. You can request more quotas by following these instructions.
Deployment, including a small test workflow can take up to 25 minutes to complete. At installation, a user is created to allow managing the host VM with username "vmadmin". The password is randomly generated and shown during installation. You may want to save the username, password and resource group name to allow for advanced debugging later.
Prepare, start or abort a workflow using instructions here.
Cromwell on Azure deployed resources
Once deployed, Cromwell on Azure configures the following Azure resources:
- Host VM - runs Ubuntu 18.04 LTS and Docker Compose with four containers (Cromwell, MySQL, TES, TriggerService). Blobfuse is used to mount the default storage account as a local file system available to the four containers. Also created are an OS and data disk, network interface, public IP address, virtual network, and network security group. Learn more
- Batch account - The Azure Batch account is used by TES to spin up the virtual machines that run each task in a workflow. After deployment, create an Azure support request to increase your core quotas if you plan on running large workflows. Learn more
- Storage account - The Azure Storage account is mounted to the host VM using blobfuse, which enables Azure Block Blobs to be mounted as a local file system available to the four containers running in Docker. By default, it includes the following Blob containers -
configuration
,cromwell-executions
,cromwell-workflow-logs
,inputs
,outputs
, andworkflows
. - Application Insights - This contains logs from TES and the Trigger Service to enable debugging.
- Cosmos DB - This database is used by TES, and includes information and metadata about each TES task that is run as part of a workflow.
All of these resources will be grouped under a single resource group in your account, which you can view on the Azure Portal. Note that your specific resource group name, host VM name and host VM password for username "vmadmin" are printed to the screen during deployment. You can store these for your future use, or you can reset the VM's password at a later date via the Azure Portal.
You can follow these steps if you wish to mount a different Azure Storage account that you manage or own, to your Cromwell on Azure instance.
Connect to existing Azure resources I own that are not part of the Cromwell on Azure instance by default
Cromwell on Azure uses managed identities to allow the host VM to connect to Azure resources in a simple and secure manner.
At the time of installation, a managed identity is created and associated with the host VM.
Cromwell on Azure version 2.x
Since version 2.0, a user managed identity is created with the name {resource-group-name}-identity
in the deployment resource group.
Cromwell on Azure version 1.x
For version 1.x and below, a system managed identity is created. You can find the identity via the Azure Portal by searching for the VM name in Azure Active Directory, under "All Applications". Or you may use Azure CLI show
command as described here.
To allow the host VM to connect to custom Azure resources like Storage Account, Batch Account etc. you can use the Azure Portal or Azure CLI to find the managed identity of the host VM (if using Cromwell on Azure version 1.x) or the user-managed identity (if using Cromwell on Azure version 2.x and above) and add it as a Contributor to the required Azure resource.
For convenience, some configuration files are hosted on your Cromwell on Azure Storage account, in the "configuration" container - containers-to-mount
, and cromwell-application.conf
. You can modify and save these file using Azure Portal UI "Edit Blob" option or simply upload a new file to replace the existing one.
For these changes to take effect, be sure to restart your Cromwell on Azure VM through the Azure Portal UI or run sudo reboot
.
Hello World WDL test
As part of the Cromwell on Azure deployment, a "Hello World" workflow is automatically run as a check. The input files for this workflow are found in the inputs
container, and the output files can be found in the cromwell-executions
container of your default storage account.
Once it runs to completion you can find the trigger JSON file that started the workflow in the workflows
container in the succeeded
directory, if it ran successfully.
Hello World WDL file:
task hello {
String name
command {
echo 'Hello ${name}!'
}
output {
File response = stdout()
}
runtime {
docker: 'ubuntu:16.04'
}
}
workflow test {
call hello
}
Hello World inputs.json file:
{
"test.hello.name": "World"
}
Hello World trigger JSON file as seen in your storage account's workflows
container in the succeeded
directory:
{
"WorkflowUrl": "/<storageaccountname>/inputs/test/test.wdl",
"WorkflowInputsUrl": "/<storageaccountname>/inputs/test/test.json",
"WorkflowOptionsUrl": null,
"WorkflowDependenciesUrl": null
}
If your "Hello-World" test workflow or other workflows consistently fail, make sure to check your Azure Batch account quotas.
Run Common Workflows
Run Broad Institute of MIT and Harvard's Best Practices Pipelines on Cromwell on Azure:
Data pre-processing for variant discovery
Germline short variant discovery (SNPs + Indels)
Somatic short variant discovery (SNVs + Indels)
Variant-filtering with Convolutional Neural Networks
Sequence data format conversion