Merge pull request #2 from matthansen0/main

Cromwell on Azure README Draft Updates
This commit is contained in:
Joe Karasha 2021-04-22 14:08:25 -07:00 коммит произвёл GitHub
Родитель ccd0a14297 ae48559e93
Коммит 738c1bd5b7
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 27 добавлений и 28 удалений

Просмотреть файл

@ -1,6 +1,6 @@
# Overview
Processing genomic data is not a monolithic task, instead it's broken down into smaller dependent tasks that are run using different tools. These tasks are usually chained together to form a pipeline that is then run using on-prem or cloud based clusters.
Processing genomic data is not a monolithic task, instead it's broken down into smaller dependent tasks that are run using different tools. These tasks are usually chained together to form a pipeline that is then run using on-prem, or cloud based clusters.
In order to ensure reproducibility and consistency in the pipeline output, some common best practices and standards have evolved over time. One of the most popular is the [GATK](https://gatk.broadinstitute.org/hc/en-us) from the [Broad Institute](https://www.broadinstitute.org/). There are a couple of specific pipelines that are captured in the GATK, for our discussion we'll focus on the two most commonly used: **Germline short variant discovery(SNVs + Indels)** and **Somatic short variant discovery (SNVs + Indel)**.
@ -18,10 +18,10 @@ Somatic
[Credit: Broad Institute][2]
# Cromwell on Azure..
As you can see from the illustrations above, processing genomic data is fairly complex task. To add to the complexity, these tools usually have their own compute and runtime dependencies. This complexity in process has been compounded by the massive increase in data as genomics becomes more extensively used in clinical, research and pharma settings.
# Cromwell on Azure
As you can see from the illustrations above, processing genomic data is a fairly complex task. To add to the complexity, these tools usually have their own compute and runtime dependencies. This complexity in process has been compounded by the massive increase in data as genomics becomes more extensively used in clinical, research and pharma settings.
Scaling these systems, making sure researchers have the right type and amount of compute when needed led to a need to decouple the workflow definition from the compute required to execute them. This led to the growth of sytems like [Cromwell](https://github.com/broadinstitute/cromwell) which came out of work at the [Broad Institute](https://www.broadinstitute.org/). **Cromwell** is an open-source Workflow Management System for bioinformatics.
Scaling these systems and making sure researchers have the right type and amount of compute when needed led to a need to decouple the workflow definition from the compute required to execute them. This led to the growth of systems like [Cromwell](https://github.com/broadinstitute/cromwell) which came out of work at the [Broad Institute](https://www.broadinstitute.org/). **Cromwell** is an open-source Workflow Management System for bioinformatics.
[Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure#Cromwell-on-Azure) is an open source implementation of **Cromwell** that allows you to run it natively on Azure. **Cromwell on Azure** uses the [GA4GH](https://github.com/ga4gh/wiki/wiki) **Task Execution Service (TES)** backend. To make managing compute easier, **Cromwell on Azure** orchestrates dynamic provisioning of compute resources via [Azure Batch](https://azure.microsoft.com/en-us/services/batch/). As you scale up your workflows, the compute needed dynamically scales up to handle the increased load.
@ -31,7 +31,7 @@ Crowmwell on Azure
## Running Cromwell on Azure
Step-by-step links to setup **Cromwell on Azure** in your Azure environment. When this section is complete, you will have Cromwell running on your Azure environment and a test flow **Hello World WDL test** ran successfully.
This section includes the step-by-step links to setup **Cromwell on Azure** in your Azure environment. When this section is complete, you will have Cromwell running on your Azure environment and a test flow **Hello World WDL test** ran successfully.
- What is [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure#cromwell-on-azure)?
- Steps to [Deploy your instance of Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure#deploy-your-instance-of-cromwell-on-azure)
@ -43,24 +43,24 @@ Step-by-step links to setup **Cromwell on Azure** in your Azure environment. Whe
Deployment takes ~20 minutes.
![Deployment Process](./../99-Images/cromwell-deploy.png)
When complete, you will see these resources in Azure,
When complete, you will see these resources in Azure.
![Cromwell Resources](./../99-Images/cromwell_resources.png)
- "Hello World" workflow is automatically run as a check. In your default storage account,
- Input files including `test.wdl`, `inputFile.txt`and `testInputs.json` are found in `inputs/test` container
- Output files are found in `cromwell-executions` container
- After completion, the trigger JSON will be in `workflows` container in `succeeded` directory.
- The "Hello World" workflow is automatically run as a check in your default storage account.
- Input files including `test.wdl`, `inputFile.txt`and `testInputs.json` are found in `inputs/test` container.
- Output files are found in `cromwell-executions` container.
- After completion, the trigger JSON will be in `workflows` container in the `succeeded` directory.
## Running Germline alignment and variant calling pipeline on Azure
Here is an example of running the germline alignment and variant calling pipeline, based on Best Practices [Genome Analysis Pipeline](https://github.com/microsoft/gatk4-genome-processing-pipeline-azure#germline-alignment-and-variant-calling-pipeline-on-azure) by Broad Institute of MIT and Harvard, on Cromwell on Azure.
Here is an example of running the germline alignment and variant calling pipeline using Cromwell on Azure and which uses the Best Practices [Genome Analysis Pipeline](https://github.com/microsoft/gatk4-genome-processing-pipeline-azure#germline-alignment-and-variant-calling-pipeline-on-azure) as documented by the Broad Institute of MIT and Harvard.
- Navigate to the germline Github with the above link
- Download `WholeGenomeGermlineSingleSample.trigger.json` trigger json file
- Start your workflow
- Navigate to the default storage account created above.
- In the `workflows` container, place the trigger json file `WholeGenomeGermlineSingleSample.trigger.json` in the `new` directory via Azure Portal or Azure Storage Explorer. This initiates a Cromwell workflow. In the trigger json file, `WorkflowUrl` points to the WDL file `WholeGenomeGermlineSingleSample.wdl` and `WorkflowinputsUrl` points to input file `WholeGenomeGermlineSingleSample.inputs.json`, both are in the same [Github](https://github.com/microsoft/gatk4-genome-processing-pipeline-azure#germline-alignment-and-variant-calling-pipeline-on-azure). These files could be added as-is or updated for your functionality to `inputs` container and trigger file updated to point to the `input` container.
- Navigate to the germline Github with the above link.
- Download `WholeGenomeGermlineSingleSample.trigger.json` and trigger the json file.
- Start your workflow.
- Navigate to the default storage account created above.
- In the `workflows` container, place the trigger json file `WholeGenomeGermlineSingleSample.trigger.json` in the `new` directory via Azure Portal or Azure Storage Explorer. This initiates a Cromwell workflow. In the trigger json file, `WorkflowUrl` points to the WDL file `WholeGenomeGermlineSingleSample.wdl` and `WorkflowinputsUrl` points to input file `WholeGenomeGermlineSingleSample.inputs.json`, both of which are in the same [Github](https://github.com/microsoft/gatk4-genome-processing-pipeline-azure#germline-alignment-and-variant-calling-pipeline-on-azure). These files could be added as-is, or updated for your functionality to the `inputs` container and trigger the updated file to point to the `input` container.
- Break-down of the WDL file `WholeGenomeGermlineSingleSample.wdl`. This WDL pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and Indel discovery in human whole-genome data using 6 WDL files from the same Github: `UnmappedBamToAlignedBam.wdl, AggregatedBamQC.wdl, Qc.wdl, BamToCram.wdl, VariantCalling.wdl, GermlineStructs.wdl`. Within each of these WDL files are many sub WDL files.
- The workflow returns a workflow ID that is appended to the trigger JSON file name and transferred to the `inprogress` directory in the workflows container.
- The workflow returns a workflow ID that is appended to the trigger JSON file name and transferred to the `inprogress` directory in the workflows container.
- Once your workflow completes, you can view the output files of your workflow in the `cromwell-executions` container. 6 folders are created for the 6 import WDL files, and sub-folders within each for the sub-import WDL files and so on.
- Additional output files from the Cromwell endpoint, including metadata and the timing file, are found in the `outputs` container. The outputs.json file shows all outputs created and where they are stored. To learn more about Cromwell's metadata and timing information, visit the [Cromwell documentation](https://cromwell.readthedocs.io/en/stable/).
- To abort a workflow that is in-progress, navigate to `workflows` container, place an empty file in the `abort` virtual directory named cromwellID.json, where "cromwellID" is the Cromwell workflow ID you wish to abort.
@ -68,21 +68,20 @@ Here is an example of running the germline alignment and variant calling pipelin
## Running Somatic short variant analysis pipeline on Azure
Here is an example of running the somatic short variant analysis pipeline, based on Best Practices [Genome Analysis Pipeline](https://github.com/microsoft/gatk4-somatic-snvs-indels-azure#somatic-short-variant-analysis-pipeline-on-azure) by Broad Institute of MIT and Harvard, on Cromwell on Azure.
Here is an example of running the somatic short variant analysis pipeline using Cromwell on Azure and which uses the Best Practices [Genome Analysis Pipeline](https://github.com/microsoft/gatk4-somatic-snvs-indels-azure#somatic-short-variant-analysis-pipeline-on-azure) as documented by the Broad Institute of MIT and Harvard.
- Navigate to the germline Github with the above link
- Download `mutect2.trigger.json` and `mutect2_pon.trigger.json` trigger json files
- Start your workflow
- Navigate to the default storage account created above.
- In the `workflows` container, place the trigger json files `mutect2.trigger.json` and `mutect2_pon.trigger.json` in the `new` directory via Azure Portal or Azure Storage Explorer. This initiates a Cromwell workflow. In the trigger `mutect2.trigger.json` file, `WorkflowUrl` points to the WDL file `mutect2.wdl` and `WorkflowinputsUrl` points to input file `mutect2.inputs.json`. In the trigger `mutect2_pon.trigger.json` file, `WorkflowUrl` points to the WDL file `mutect2_pon.wdl` and `WorkflowinputsUrl` points to input file `mutect2_pon.inputs.json`. All these are in the same [Github](https://github.com/microsoft/gatk4-somatic-snvs-indels-azure#somatic-short-variant-analysis-pipeline-on-azure). These files could be added as-is or updated for your functionality to `inputs` container and trigger file updated to point to the `input` container.
- Navigate to the germline Github with the above link.
- Download `mutect2.trigger.json` and `mutect2_pon.trigger.json` trigger json files.
- Start your workflow.
- Navigate to the default storage account created above.
- In the `workflows` container, place the trigger json files `mutect2.trigger.json` and `mutect2_pon.trigger.json` in the `new` directory via Azure Portal or Azure Storage Explorer. This initiates a Cromwell workflow. In the trigger `mutect2.trigger.json` file, `WorkflowUrl` points to the WDL file `mutect2.wdl` and `WorkflowinputsUrl` points to input file `mutect2.inputs.json`. In the trigger `mutect2_pon.trigger.json` file, `WorkflowUrl` points to the WDL file `mutect2_pon.wdl` and `WorkflowinputsUrl` points to input file `mutect2_pon.inputs.json`. All these are in the same [Github](https://github.com/microsoft/gatk4-somatic-snvs-indels-azure#somatic-short-variant-analysis-pipeline-on-azure). These files could be added as-is or updated for your functionality to the `inputs` container and trigger the updated file to point to the `input` container.
- The WDL file `mutect2.wdl` runs GATK4 Mutect 2 on a single tumor-normal pair or on a single tumor sample, and performs additional filtering and functional annotation tasks. The WDL file `mutect2_pon.wdl` creates a Mutect2 panel of normals.
- The workflow returns a workflow ID that is appended to the trigger JSON file name and transferred to the `inprogress` directory in the workflows container.
- Once your workflow completes, you can view the output files of your workflow in the `cromwell-executions` container.
- Additional output files from the Cromwell endpoint, including metadata and the timing file, are found in the `outputs` container. The outputs.json file shows all outputs created and where they are stored. The trigger files each creates one vcf file and its index with primary filtering applied. To learn more about Cromwell's metadata and timing information, visit the [Cromwell documentation](https://cromwell.readthedocs.io/en/stable/).
- The workflow returns a workflow ID that is appended to the trigger JSON file name and transferred to the `inprogress` directory in the workflows container.
- Once your workflow completes, you can view the output files of your workflow in the `cromwell-executions` container.
- Additional output files from the Cromwell endpoint, including metadata and the timing file, are found in the `outputs` container. The outputs.json file shows all outputs created and where they are stored. The trigger files each create one vcf file and its index with primary filtering applied. To learn more about Cromwell's metadata and timing information, visit the [Cromwell documentation](https://cromwell.readthedocs.io/en/stable/).
- To abort a workflow that is in-progress, navigate to `workflows` container, place an empty file in the `abort` virtual directory named cromwellID.json, where "cromwellID" is the Cromwell workflow ID you wish to abort.
- [More details](https://github.com/microsoft/CromwellOnAzure/blob/master/docs/managing-your-workflow.md/#start-your-workflow) on starting the workflow.
## Additional Resources
- [Germline short variant discovery SNPs + Indels](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-)
- [Somatic short variant discovery SNPs + Indels](https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-)
- [Somatic short variant discovery SNPs + Indels](https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-)