Analysis
Issues Addressed
- I submitted a job and it failed almost immediately. What happened to it?
- I can only run small workflows but not workflows that require multiple tasks with a large total cpu cores requirement. How do I increase workflows capacity?
- How do I setup my own WDL to run on Cromwell?
- How can I see how far along my workflow has progressed?
- My workflow failed at task X. Where should I look to determine why it failed?
- Which tasks failed?
- Some tasks are stuck or my workflow is stuck in the "inprogress" directory in the "workflows" container. Were there Azure infrastructure issues?
- My jobs are taking a long time in the "Preparing" task state, even with "smaller" input files and VMs being used. Why is that?
Job failed immediately
If a workflow you start has a task that failed immediately and lead to workflow failure be sure to check your input JSON files. Follow the instructions here and check out an example WDL and inputs JSON file here to ensure there are no errors in defining your input files.
For files hosted on an Azure Storage account that is connected to your Cromwell on Azure instance, the input path consists of three parts - the storage account name, the blob container name, file path with extension, following this format:
/<storageaccountname>/<containername>/<blobName>
Example file path for an "inputs" container in a storage account "msgenpublicdata" will look like
"/msgenpublicdata/inputs/chr21.read1.fq.gz"
Another possibility is that you are trying to use a storage account that hasn't been mounted to your Cromwell on Azure instance - either by default during setup or by following these steps to mount a different storage account.
Check out these known issues and mitigation for more commonly seen issues caused by bugs we are actively tracking.
Check Azure Batch account quotas
If you are running a task in a workflow with a large cpu cores requirement, check if your Batch account has enough resource quotas. You can request more quotas by following these instructions.
For other resource quotas, like active jobs or pools, if there are not enough resources available, Cromwell on Azure keeps the tasks in queue until resources become available. This may lead to longer wait times for workflow completion.
Set up my own WDL
To get started you can view this Hello World sample, an example WDL to convert FASTQ to UBAM or follow these steps to convert an existing public WDL for other clouds to run on Azure.
There are also links to ready-to-try WDLs for common workflows here
Instructions to write a WDL file for a pipeline from scratch are COMING SOON.
Check all tasks running for a workflow using batch account
Each task in a workflow starts an Azure Batch VM. To see currently active tasks, navigate to your Azure Batch account connected to Cromwell on Azure on Azure Portal. Click on "Jobs" and then search for the Cromwell workflowId
to see all tasks associated with a workflow.
Find which tasks failed in a workflow
Cosmos DB stores information about all tasks in a workflow. For monitoring or debugging any workflow you may choose to query the database.
Navigate to your Cosmos DB instance on Azure Portal. Click on the "Data Explorer" menu item, click on the "TES" container, and select "Items".
You can write a SQL query to get all tasks that have not completed successfully in a workflow using the following query, replacing workflowId
with the id returned from Cromwell for your workflow:
SELECT * FROM c where startswith(c.description,"workflowId") AND c.state != "COMPLETE"
OR
SELECT * FROM c where startswith(c.id,"<first 9 character of the workflowId>") AND c.state != "COMPLETE"
Make sure there are no Azure infrastructure errors
When working with Cromwell on Azure, you may run into issues with Azure Batch or Storage accounts. For instance, if a file path cannot be found or if the WDL workflow failed for an unknown reason. For these scenarios, consider debugging or collecting more information using Application Insights.
Navigate to your Application Insights instance on Azure Portal. Click on the "Logs (Analytics)" menu item under the "Monitoring" section to get all logs from Cromwell on Azure's TES backend.
You can explore exceptions or logs to find the reason for failure and use time ranges or Kusto Query Language to narrow your search.
Check Azure Storage Tier
Cromwell utilizes Blob storage containers and Blobfuse to allow your data to be accessed and processed. The Blob Storage Access Tier can have a demonstrable effect on your analysis time, particularly on your initial VM preparation. If you experience this, we recommend setting your access tier to "Hot" instead of "Cool". You can do this under the "Access Tier" settings in the "Configuration" menu on Azure Portal. NOTE: this only affects users utilizing Gen2 Storage Accounts. All Gen 1 "Standard" blobs are access tier "Hot" by default.
To search, expand the Pages section above.