0.2 release (#395)

* cleaning up files which are no longer needed * fixes after removing forking workflow (#322) * PR to resolve merge issues * updated main build as well * added ability to read in git branch name directly * manually updated the other files * fixed number of classes for main build tests (#327) * fixed number of classes for main build tests * corrected DATASET.ROOT in builds * added dev build script * Fixes for development inside the docker container (#335) * Fix the mound command for the HRNet pretrained model in the docker readme * Properly catch InvalidGitRepository exception * make repo paths consistent with non-docker runs -- this way configs paths do not need to be changed * Properly catch InvalidGitRepository exception in train.py * Readme update (#337) * README updates * Removing user specific path from config Authored-by: Fatemeh Zamanian <Fatemeh.Zamanian@microsoft.com> * Fixing #324 and #325 (#338) * update colormap to a non-discrete one -- fixes #324 * fix mask_to_disk to normalize by n_classes * changes to test.py * Updating data.py * bug fix * increased timeout time for main_build * retrigger build * retrigger the build * increase timeout * fixes 318 (#339) * finished 318 * increased checkerboard test timeout * fix 333 (#340) * added label correction to train gradient * changing the gradient data generator to take inline/crossline argument conssistent with the patchloader * changing variable name to be more descriptive Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * bug fix to model predictions (#345) * replace hrnet with seresnet in experiments - provides stable default model (#343) * PR to fix #342 (#347) * intermediate work for normalization * 1) normalize function runs based on global MIN and MAX 2) has a error handling for division by zero, np.finfo 3) decode_segmap normalizes the label/mask based on the n_calsses * global normalization added to test.py * increasing the threshold on timeout * trigger * revert * idk what happened * increase timeout * picking up global min and max * passing config to TrainPatchLoader to facilitate access to global min and max and other attr in low level functions, WIP * removed print statement * changed section loaders * updated test for min and max from config too * adde MIN and MAX to config * notebook modified for loaders * another dataloader in notebook * readme update * changed the default values for min max, updated the docstring for loaders, removed suppressed lines * debug * merging work from CSE team into main staging branch (#357) * Adding content to interpretation README (#171) * added sharat, weehyong to authors * adding a download script for Dutch F3 dataset * Adding script instructions for dutch f3 * Update README.md prepare scripts expect root level directory for dutch f3 dataset. (it is downloaded into $dir/data by the script) * Adding readme text for the notebooks and checking if config is correctly setup * fixing prepare script example * Adding more content to interpretation README * Update README.md * Update HRNet_Penobscot_demo_notebook.ipynb Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * Updates to prepare dutchf3 (#185) * updating patch to patch_size when we are using it as an integer * modifying the range function in the prepare_dutchf3 script to get all of our data * updating path to logging.config so the script can locate it * manually reverting back log path to troubleshoot build tests * updating patch to patch_size for testing on preprocessing scripts * updating patch to patch_size where applicable in ablation.sh * reverting back changes on ablation.sh to validate build pass * update patch to patch_size in ablation.sh (#191) Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com> * TestLoader's support for custom paths (#196) * Add testloader support for custom paths. * Add test * added file name workaround for Train*Loader classes * adding comments and clean up * Remove legacy code. * Remove parameters that dont exist in init() from documentation. * Add unit tests for data loaders in dutchf3 * moved unit tests Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * select contiguous data splits for val and train (#200) * select contiguous data splits for test and train * changed data-dir to data_dir as arg to prepare_dutchf3.py * update script with new required parameter label_file * ignoring split_alaudah_et_al_19 as it is not updated * changed TEST to VALIDATION for clarity in the code * included job to run scripts unit test * Fix val/train split and add tests * adjust to consider the whole horz_lines * update environment - gitpython version * Segy Converter Utility (#199) * Add convert_segy utility script and related notebooks * add segy files to .gitignore * readability update * Create methods for normalizing and clipping separately. * Add comment * update file paths * cleanup tests and terminology for the normalization/clipping code * update notes to provide more context for using the script * Add tests for clipping. * Update comments * added Microsoft copyright * Update root README * Add a flag to turn on clipping in dataprep script. * Remove hard coded values and fix _filder_data method. * Fix some minor issues pointed out on comments. * Remove unused lib. * Rename notebooks to impose order; set env; move all def funtions into utils; improve comments in notebooks; and include code example to run prepare_dutchf3.py * Label missing data with 255. * Remove cell with --help command. * Add notebooks to test pipeline. * grammer edits * update notebook output and utils naming * fix output dir error and cleanup notebook * fix yaml indent error in notebooks_build.yml * fix merge issues and job name errors * debugging the build pipeline * combine notebook tests for segy converter since they are dependent on each other Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com> * Azureml train pipeline (#195) * initial add of azure ml pipeline * update references and dependencies * fix integration tests * remove incomplete tests * add azureml requirements.txt for dutchf3 local patch and update pipeline config * add empty __init__.py to cv_lib dutchf3 * Get train,py to run in pipeline * allow output dir in train.py * Clean up README and __init__ * only pass output if available and use input dir for output in train.py * update comment in train.py * updating azureml_requirements to only pull from /master * removing windows guidance in azureml_pipelines/README.md * adding .env.example * adding azureml config example * updating documentation in azureml_pipelines README.md * updating main README.md to refer to AML guidance documentation * updating AML README.md to include additional guidance to cancel runs * adding documentation on AzureML pipelines in the AML README.me * adding files needed section for AML training run * including hyperlink in format poiniting to additional detail on Azure Machine Learning pipeslines in AML README.md * removing the mention of VSCode in the AML README.md * fixing typo * modifying config to pipeline configuration in README.md * fixing typo in README.md * adding documentation on how to create a blob container and copy data onto it * adding documentation on blob storage guidance * adding guidance on how to get the subscription id * adding guidance to activate environment and then run the kick off train pipeline from ROOT * adding ability to pass in experiement name and different pipeline configuration to kickoff_train_pipeline.py * adding Microsoft Corporation Copyright to kickoff_train_pipeline.py * fixing format in README.md * adding trouble shooting section in README.md for connection to subscription * updating troubleshooting title * adding guidance on how to download the config.json from the Azure Portal in the README.md * adding additional guidance and information on AzureML compute targets and naming conventions * changing the configuation file example to only include the train step that is currently supported * updating config to pipeline configuration when applicable * adding link to Microsoft docs for additional information on pipeline steps * updated AML test build definitions * updated AML test build definitions * adding job to aml_build.yml * updating example config for testing * modifying the test_train_pipeline.py to have appropriate number of pipeline steps and other required modifications * updating AML_pipeline_tests in aml_build.yml to consume environment variables * updating scriptType, sciptLocation, and inlineScript in aml_build.yml * trivial commit to re-trigger broken build pipelines * fix to aml yml build to use env vars for secrets and everything else * another yml fix * another yml fix * reverting structure format of jobs for aml_build pipeline tests * updating path to test_train_pipeline.py * aml_pipeline_tests timed out, extending timeoutInMinutes from 10 to 40 * adding additional pytest * adding az login * updating variables in aml pipeline tests Co-authored-by: Anna Zietlow <annamzietlow@gmail.com> Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * moved contrib contributions around from CSE * fixed dataloader tests - updated them to work with new code from staging branch * segyconverter notebooks and tests run and pass; updated documentation * added test job for segy converter notebooks * removed AML training pipeline from this release * fixed training model tolerance precision in the tests - wasn't working * fixed train.py build issues after the merge * addressed PR comments * fixed bug in check_performance Co-authored-by: Sharat Chikkerur <sharat.chikkerur@microsoft.com> Co-authored-by: kirasoderstrom <kirasoderstrom@gmail.com> Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com> Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com> Co-authored-by: Ricardo Squassina Lee <8495707+squassina@users.noreply.github.com> Co-authored-by: Michael Zawacki <mikezawacki@hotmail.com> Co-authored-by: Anna Zietlow <annamzietlow@gmail.com> * make tests simpler (#368) * removed Dutch F3 job from main_build * fixed a bug in data subset in debug mode * modified epoch numbers to pass the performance checks, checkedout check_performance from Max's branch * modified get_data_for_builds.sh to set up checkerboard data for smaller size, minor improvements on gen_checkerboard * send all the batches, disabled the performance checks for patch_deconvnet * added comment to enable tests for patch_deconvnet after debugging, renamed gen_checkerboard, added options to new arg per Max's suggestion * Replace HRNet with SEResNet model in the notebook (#362) * replaced HRNet with SEResNet model in the notebook * removed debugging cell info * fixed bug where resnet_unet model wasn't loading the pre-trained version in the notebook * fixed build VM problems * Multi-GPU training support (#359) * Data flow tests (#375) * renamed checkerboard job name * restructured default outputs from test.py to be dumped under output dir and not debug dir * test.py output re-org * removed outdated variable from check_performance.py * intermediate work * intermediate work * bunch of intermediate works * changing args for different trainings * final to run dev_build" * remove print statements * removed print statement * removed suppressed lines * added assertion error msg * added assertion error msg, one intential bug to test * testing a stupid bug * debug * omg * final * trigger build * fixed multi-GPU termination in train.py (#379) * PR to fix #371 and #372 (#380) * added learning rate to logs * changed epoch for patch_deconvnet, and enabled the tests * removed TODOs * changed tensorflow pinned version (#387) * changed tensorflow pinned version * trigger build * closes 385 (#389) * Fixing #259 by adding symmetric padding along depth direction (#386) * BYOD Penobscot (#390) * minor updates to files * added penobscot conversion code * docker build test (#388) * added a new job to test bulding the docker, for now it is daisy-chained to the end * this is just a TEST * test * test * remove old image * debug * debug * test * debug * enabled all the jobs * quick fix * removing non-tagged iamges Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * added missing license headers and fixed formatting (#391) * added missing license headers and fixed formatting * some more license headers * updated documentation to close 354 and 381 (#392) * fix test.py and notebook issues (#394) * resolved conflicts for 0.2 release (#396) * V00.01.00003 release (#356) * cleaning up files which are no longer needed * fixes after removing forking workflow (#322) * PR to resolve merge issues * updated main build as well * added ability to read in git branch name directly * manually updated the other files * fixed number of classes for main build tests (#327) * fixed number of classes for main build tests * corrected DATASET.ROOT in builds * added dev build script * Fixes for development inside the docker container (#335) * Fix the mound command for the HRNet pretrained model in the docker readme * Properly catch InvalidGitRepository exception * make repo paths consistent with non-docker runs -- this way configs paths do not need to be changed * Properly catch InvalidGitRepository exception in train.py * Readme update (#337) * README updates * Removing user specific path from config Authored-by: Fatemeh Zamanian <Fatemeh.Zamanian@microsoft.com> * Fixing #324 and #325 (#338) * update colormap to a non-discrete one -- fixes #324 * fix mask_to_disk to normalize by n_classes * changes to test.py * Updating data.py * bug fix * increased timeout time for main_build * retrigger build * retrigger the build * increase timeout * fixes 318 (#339) * finished 318 * increased checkerboard test timeout * fix 333 (#340) * added label correction to train gradient * changing the gradient data generator to take inline/crossline argument conssistent with the patchloader * changing variable name to be more descriptive Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * bug fix to model predictions (#345) * replace hrnet with seresnet in experiments - provides stable default model (#343) Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com> Co-authored-by: Fatemeh <fazamani@microsoft.com> * typos Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com> Co-authored-by: Fatemeh <fazamani@microsoft.com> Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com> Co-authored-by: Fatemeh <fazamani@microsoft.com> Co-authored-by: Sharat Chikkerur <sharat.chikkerur@microsoft.com> Co-authored-by: kirasoderstrom <kirasoderstrom@gmail.com> Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com> Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com> Co-authored-by: Ricardo Squassina Lee <8495707+squassina@users.noreply.github.com> Co-authored-by: Michael Zawacki <mikezawacki@hotmail.com> Co-authored-by: Anna Zietlow <annamzietlow@gmail.com>
2020-07-07 15:49:48 -04:00 · 2020-07-07 15:49:48 -04:00 · 080cf46fe9
--- a/.azureml.example/config.json
+++ b/.azureml.example/config.json
@ -0,0 +1,5 @@
+{
+    "subscription_id": "input_sub_id",
+    "resource_group": "input_resource_group",
+    "workspace_name": "input_workspace_name"
+}
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,8 @@
+BLOB_ACCOUNT_NAME=
+BLOB_CONTAINER_NAME=
+BLOB_ACCOUNT_KEY=
+BLOB_SUB_ID=
+AML_COMPUTE_CLUSTER_NAME=
+AML_COMPUTE_CLUSTER_MIN_NODES=
+AML_COMPUTE_CLUSTER_MAX_NODES=
+AML_COMPUTE_CLUSTER_SKU=
--- a/.gitignore
+++ b/.gitignore
@ -115,4 +115,8 @@ interpretation/environment/anaconda/local/src/cv-lib
 # Rope project settings
 .ropeproject

-*.pth
+*.pth
+
+# Seismic data files
+*.sgy
+*.segy
--- a/README.md
+++ b/README.md
@ -19,7 +19,7 @@ For developers, we offer a more hands-on Quick Start below.

 #### Dev Quick Start
 There are two ways to get started with the DeepSeismic codebase, which currently focuses on Interpretation:
- if you'd like to get an idea of how our interpretation (segmentation) models are used, simply review the [HRNet demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb)
+- if you'd like to get an idea of how our interpretation (segmentation) models are used, simply review the [demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb)
 - to run the code, you'll need to set up a compute environment (which includes setting up a GPU-enabled Linux VM and downloading the appropriate Anaconda Python packages) and download the datasets which you'd like to work with - detailed steps for doing this are provided in the next `Interpretation` section below.

 If you run into any problems, chances are your problem has already been solved in the [Troubleshooting](#troubleshooting) section.
@ -27,10 +27,14 @@ If you run into any problems, chances are your problem has already been solved i
 The notebook is designed to be run in demo mode by default using a pre-trained model in under 5 minutes on any reasonable Deep Learning GPU such as nVidia K80/P40/P100/V100/TitanV.

 ### Azure Machine Learning
-[Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/) enables you to train and deploy your machine learning models and pipelines at scale, and leverage open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn. If you are looking at getting started with using the code in this repository with Azure Machine Learning, refer to [Azure Machine Learning How-to](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml) to get started.
+[Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/) enables you to train and deploy your machine learning models and pipelines at scale, and leverage open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn.
+If you are looking at getting started with using the code in this repository with Azure Machine Learning, refer to [Azure Machine Learning How-to](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml) to get started.

 ## Interpretation
 For seismic interpretation, the repository consists of extensible machine learning pipelines, that shows how you can leverage state-of-the-art segmentation algorithms (UNet, SEResNET, HRNet) for seismic interpretation.
+We currently support rectangular data, i.e. 2D and 3D seismic images which form a rectangle in 2D. 
+We also provide [utilities](./examples/interpretation/segyconverter/README.md) for converting SEGY data with rectangular boundaries into numpy arrays
+where everything outside the boundary has been padded to produce a rectangular 3D numpy volume.   

 To run examples available on the repo, please follow instructions below to:
 1) [Set up the environment](#setting-up-environment)
@ -85,23 +89,19 @@ This repository provides examples on how to run seismic interpretation on Dutch

 Please make sure you have enough disk space to download either dataset.

-We have experiments and notebooks which use either one dataset or the other. Depending on which experiment/notebook you want to run you'll need to download the corresponding dataset. We suggest you start by looking at [HRNet demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb) which requires the Dutch F3 dataset.
+We have experiments and notebooks which use either one dataset or the other. Depending on which experiment/notebook you want to run you'll need to download the corresponding dataset. We suggest you start by looking at [demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb) which requires the Dutch F3 dataset.

-#### Dutch F3 Netherlands dataset prep
-To download the F3 Netherlands dataset for 2D experiments, please follow the data download instructions at
+#### Dutch F3 dataset prep
+To download the Dutch F3 dataset for 2D experiments, please follow the data download instructions at
 [this github repository](https://github.com/yalaudah/facies_classification_benchmark) (section Dataset). Atternatively, you can use the [download script](scripts/download_dutch_f3.sh)

-```
+```bash
 data_dir="$HOME/data/dutch"
 mkdir -p "${data_dir}"
 ./scripts/download_dutch_f3.sh "${data_dir}"
 ```
-
-Download scripts also automatically create any subfolders in `${data_dir}` which are needed for the data preprocessing scripts.
-
-At this point, your `${data_dir}` directory should contain a `data` folder, which should look like this:
-
-```
+Download scripts also automatically create any subfolders in `${data_dir}` which are needed for the data preprocessing scripts. At this point, your `${data_dir}` directory should contain a `data` folder, which should look like this:
+```bash
 data
 ├── splits
 ├── test_once
@ -113,10 +113,8 @@ data
    ├── train_labels.npy
    └── train_seismic.npy
 ```
-
 To prepare the data for the experiments (e.g. split into train/val/test), please run the following script:
-
-```
+```bash
 # change working directory to scripts folder
 cd scripts

@ -125,40 +123,66 @@ python prepare_dutchf3.py split_train_val patch --data_dir=${data_dir}/data --la
 --stride=50 --patch_size=100 --split_direction=both

 # For section-based experiments
-python prepare_dutchf3.py split_train_val section --data-dir=${data_dir}/data --label_file=train/train_labels.npy --output_dir=splits \ --split_direction=both
+python prepare_dutchf3.py split_train_val section --data-dir=${data_dir}/data --label_file=train/train_labels.npy --output_dir=splits --split_direction=both

 # go back to repo root
 cd ..
 ```
-
 Refer to the script itself for more argument options.

+#### Bring Your Own Data [BYOD]
+
+##### Bring your own SEG-Y data
+
+If you want to train these models using your own seismic and label data, the files will need to be prepped and
+converted to npy files. Typically, the [segyio](https://pypi.org/project/segyio/) can be used to open SEG-Y files that follow the standard, but more often than not, there are non standard settings or missing traces that will cause segyio to fail. If this happens with your data, read these notebooks and scripts to help prepare your data files:
+
+* [SEG-Y Data Prep README](contrib/segyconverter/README.md)
+* [convert_segy.py utility](contrib/segyconverter/convert_segy.py) - Utility script that can read SEG-Y files with unusual byte header locations and missing traces
+* [segy_convert_sample notebook](contrib/segyconverter/segy_convert_sample.ipynb) - Details on SEG-Y data conversion
+* [segy_sample_files notebook](contrib/segyconverter/segy_sample_files.ipynb) - Create test SEG-Y files that describe the scenarios that may cause issues when converting the data to numpy arrays
+
+##### Penobscot example
+
+We also offer starter code to convert [Penobscot](https://arxiv.org/abs/1905.04307) dataset (available [here](https://zenodo.org/record/3924682))
+into Tensor format used by the Dutch F3 dataset - once converted, you can run Penobscot through the same
+mechanisms as the Dutch F3 dataset. The rough sequence of steps is:
+
+```bash
+conda activate seismic-interpretation
+cd scripts
+wget -o /dev/null -O dataset.h5 https://zenodo.org/record/3924682/files/dataset.h5?download=1
+# convert penobscot
+python byod_penobscot.py --filename dataset.h5 --outdir <where to output data>
+# preprocess for experiments
+python prepare_dutchf3.py split_train_val patch --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
+```
+
 ### Run Examples

 #### Notebooks
 We provide example notebooks under `examples/interpretation/notebooks/` to demonstrate how to train seismic interpretation models and evaluate them on Penobscot and F3 datasets. 

 Make sure to run the notebooks in the conda environment we previously set up (`seismic-interpretation`). To register the conda environment in Jupyter, please run:
-
 ```
 python -m ipykernel install --user --name seismic-interpretation
 ```
-
 __Optional__: if you plan to develop a notebook, you can install black formatter with the following commands:
 ```bash
 conda activate seismic-interpretation
 jupyter nbextension install https://github.com/drillan/jupyter-black/archive/master.zip --user
 jupyter nbextension enable jupyter-black-master/jupyter-black
 ```
-
 This will enable your notebook with a Black formatter button, which then clicked will automatically format a notebook cell which you're in.

 #### Experiments

-We also provide scripts for a number of experiments we conducted using different segmentation approaches. These experiments are available under `experiments/interpretation`, and can be used as examples. Within each experiment start from the `train.sh` and `test.sh` scripts under the `local/` directory, which invoke the corresponding python scripts, `train.py` and `test.py`. Take a look at the experiment configurations (see Experiment Configuration Files section below) for experiment options and modify if necessary.
+We also provide scripts for a number of experiments we conducted using different segmentation approaches. These experiments are available under `experiments/interpretation`, and can be used as examples. Within each experiment start from the `train.sh` and `test.sh` scripts which invoke the corresponding python scripts, `train.py` and `test.py`. Take a look at the experiment configurations (see Experiment Configuration Files section below) for experiment options and modify if necessary.

-This release currently supports Dutch F3 local execution
- [F3 Netherlands Patch](experiments/interpretation/dutchf3_patch/README.md)
+This release currently supports Dutch F3 local and distributed training
+- [Dutch F3 Patch](experiments/interpretation/dutchf3_patch/README.md)
+
+Please note that we use [NVIDIA's NCCL](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html) library to enable distributed training. Please follow the installation instructions [here](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html#down) to install NCCL on your system.   

 #### Configuration Files
 We use [YACS](https://github.com/rbgirshick/yacs) configuration library to manage configuration options for the experiments. There are three ways to pass arguments to the experiment scripts (e.g. train.py or test.py):
@ -166,17 +190,22 @@ We use [YACS](https://github.com/rbgirshick/yacs) configuration library to manag
 - __default.py__ - A project config file `default.py` is a one-stop reference point for all configurable options, and provides sensible defaults for all arguments. If no arguments are passed to `train.py` or `test.py` script (e.g. `python train.py`), the arguments are by default loaded from `default.py`. Please take a look at `default.py` to familiarize yourself with the experiment arguments the script you run uses.

 - __yml config files__ - YAML configuration files under `configs/` are typically created one for each experiment. These are meant to be used for repeatable experiment runs and reproducible settings. Each configuration file only overrides the options that are changing in that experiment (e.g. options loaded from `defaults.py` during an experiment run will be overridden by arguments loaded from the yaml file). As an example, to use yml configuration file with the training script, run:
-
    ```
    python train.py --cfg "configs/seresnet_unet.yaml"
    ```

 - __command line__ - Finally, options can be passed in through `options` argument, and those will override arguments loaded from the configuration file. We created CLIs for all our scripts (using Python Fire library), so you can pass these options via command-line arguments, like so:
-
    ```
    python train.py DATASET.ROOT "/home/username/data/dutch/data" TRAIN.END_EPOCH 10
    ```

+#### Training
+We run an aggressive cosine annealing schedule which starts with a higher Learning Rate (LR) and gradually lowers it over approximately 60 epochs to zero, 
+at which point we raise LR back up to its original value and lower it again for about 60 epochs; this process continues 5 times, forming 60*5=300 training epochs in total
+in 5 cycles; model with the best frequency-weighted IoU is snapshotted to disc during each cycle. We suggest consulting TensorBoard logs to see which training cycle
+produced the best model and use that model during scoring.
+
+For multi-GPU training, we run a linear burn-in LR schedule before starting the 5 cosine cycles, then the training continues the same way as for single-GPU.

 ### Pretrained Models

@ -184,14 +213,6 @@ There are two types of pre-trained models used by this repo:
 1. pre-trained models trained on non-seismic Computer Vision datasets which we fine-tune for the seismic domain through re-training on seismic data
 2. models which we already trained on seismic data - these are downloaded automatically by our code if needed (again, please see the notebook for a demo above regarding how this is done).

-#### HRNet ImageNet weights model
-
-To enable training from scratch on seismic data and to achieve the same results as the benchmarks quoted below you will need to download the HRNet model [pretrained](https://github.com/HRNet/HRNet-Image-Classification) on ImageNet. We are specifically using the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pre-trained model; other  HRNet variants are also available [here](https://github.com/HRNet/HRNet-Image-Classification) - you can navigate to those from the [main HRNet landing page](https://github.com/HRNet/HRNet-Object-Detection) for object detection.
-
-Unfortunately, the OneDrive location which is used to host the model is using a temporary authentication token, so there is no way for us to script up model download. There are two ways to upload and use the pre-trained HRNet model on DS VM:
- download the model to your local drive using a web browser of your choice and then upload the model to the DS VM using something like `scp`; navigate to Portal and copy DS VM's public IP from the Overview panel of your DS VM (you can search your DS VM by name in the search bar of the Portal) then use `scp local_model_location username@DS_VM_public_IP:./model/save/path` to upload
- alternatively, you can use the same public IP to open remote desktop over SSH to your Linux VM using [X2Go](https://wiki.x2go.org/doku.php/download:start): you can basically open the web browser on your VM this way and download the model to VM's disk
-

 ### Viewers (optional)

@ -222,20 +243,21 @@ This section contains benchmarks of different algorithms for seismic interpretat

 #### Dutch F3

-| Source         | Experiment                  | PA    | FW IoU | MCA  | V100 (16GB) training time |
-| -------------- | --------------------------- | ----- | ------ | ---- | ------------------------- |
-| Alaudah et al. | Section-based               | 0.905 | 0.817  | .832 | N/A                       |
-|                | Patch-based                 | 0.852 | 0.743  | .689 | N/A                       |
-| DeepSeismic    | Patch-based+fixed           | .875  | .784   | .740 | 08h 54min                 |
-|                | SEResNet UNet+section depth | .910  | .841   | .809 | 55h 02min                 |
-|                | HRNet(patch)+patch_depth    | .884  | .795   | .739 | 67h 41min                 |
-|                | HRNet(patch)+section_depth  | .900  | .820   | .767 | 55h 08min                 |
+| Source         | Experiment                                | PA    | FW IoU | MCA  | V100 (16GB) training time |
+| -------------- | ----------------------------------------- | ----- | ------ | ---- | ------------------------- |
+| Alaudah et al. | Section-based                             | 0.905 | 0.817  | .832 | N/A                       |
+|                | Patch-based                               | 0.852 | 0.743  | .689 | N/A                       |
+| DeepSeismic    | Patch-based+fixed                         | .875  | .784   | .740 | 08h 54min                 |
+|                | SEResNet UNet+section depth               | .910  | .841   | .809 | 55h 02min                 |
+|                | HRNet(patch)+patch_depth (experimental)   | .884  | .795   | .739 | 67h 41min                 |
+|                | HRNet(patch)+section_depth (experimental) | .900  | .820   | .767 | 55h 08min                 |

+Note: these are single-run performance numbers and we expect the results to fluctuate in-between different runs, i.e. some variability is to be expected,
+but we expect the performance numbers to be close to these with this codebase.

 #### Reproduce benchmarks
-In order to reproduce the benchmarks, you will need to navigate to the [experiments](experiments) folder. In there, each of the experiments are split into different folders. To run the Netherlands F3 experiment navigate to the [dutchf3_patch/local](experiments/interpretation/dutchf3_patch/local) folder. In there is a training script [([train.sh](experiments/interpretation/dutchf3_patch/local/train.sh))
-which will run the training for any configuration you pass in. Once you have run the training you will need to run the [test.sh](experiments/interpretation/dutchf3_patch/local/test.sh) script. Make sure you specify
-the path to the best performing model from your training run, either by passing it in as an argument or altering the YACS config file. 
+In order to reproduce the benchmarks, you will need to navigate to the [experiments](experiments) folder. In there, each of the experiments are split into different folders. To run the Dutch F3 experiment navigate to the [dutchf3_patch](experiments/interpretation/dutchf3_patch/) folder. In there is a training script [train.sh](experiments/interpretation/dutchf3_patch/train.sh)
+which will run the training for any configuration you pass in. If your machine has multiple GPUs, you can run distributed training using the distributed training script [train_distributed.sh](experiments/interpretation/dutchf3_patch/train_distributed.sh). Once you have run the training you will need to run the [test.sh](experiments/interpretation/dutchf3_patch/test.sh) script. Make sure you specify the path to the best performing model from your training run, either by passing it in as an argument or altering the YACS config file. 

 ## Contributing

@ -288,11 +310,11 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
  <summary><b>Data Science Virtual Machine conda package installation warnings</b></summary>

  It could happen that while creating the conda environment defined by `environment/anaconda/local/environment.yml` on an Ubuntu DSVM, one can get multiple warnings like so:
-  ```
+  ```bash
  WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(140): Could not remove or rename /anaconda/pkgs/ipywidgets-7.5.1-py_0/site-packages/ipywidgets-7.5.1.dist-info/LICENSE.  Please remove this file manually (you may need to reboot to free file handles)  
  ```
    
-  If this happens, similar to instructions above, stop the conda environment creation (type ```Ctrl+C```) and then change recursively the ownership /anaconda directory from root to current user, by running this command: 
+  If this happens, similar to instructions above, stop the conda environment creation (type ```Ctrl+C```) and then change recursively the ownership `/anaconda` directory from root to current user, by running this command: 

  ```bash
  sudo chown -R $USER /anaconda
@ -322,17 +344,14 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
  torch.cuda.is_available() 
  ```

-  The output should say "True".
-
-  If the output is still "False", you may want to try setting your environment variable to specify the device manually - to test this, start a new `ipython` session and type:
+  The output should say `True`. If the output is still `False`, you may want to try setting your environment variable to specify the device manually - to test this, start a new `ipython` session and type:
  ```python
  import os
  os.environ['CUDA_VISIBLE_DEVICES']='0'
  import torch                                                                                  
  torch.cuda.is_available() 
  ```
-
-  The output should say "True" this time. If it does, you can make the change permanent by adding
+  The output should say `True` this time. If it does, you can make the change permanent by adding:
  ```bash
  export CUDA_VISIBLE_DEVICES=0
  ```
@ -367,4 +386,3 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
  5. Navigate back to the Virtual Machine view in Step 2 and click the Start button to start the virtual machine.

 </details>
-
--- a/conftest.py
+++ b/conftest.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/contrib/README.md
+++ b/contrib/README.md
@ -6,3 +6,15 @@ We encourage submissions to the contrib folder, and once they are well-tested, d

 Thank you.

+#### Azure Machine Learning
+If you would like to leverage Azure Machine Learning to create a Training Pipeline with this dataset we have guidance on how do so [here](interpretation/deepseismic_interpretation/azureml_pipelines/README.md)
+
+### HRNet model guidance (experimental for now)
+
+#### HRNet ImageNet weights model
+
+To enable training from scratch on seismic data and to achieve the same results as the benchmarks quoted below you will need to download the HRNet model [pretrained](https://github.com/HRNet/HRNet-Image-Classification) on ImageNet. We are specifically using the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pre-trained model; other  HRNet variants are also available [here](https://github.com/HRNet/HRNet-Image-Classification) - you can navigate to those from the [main HRNet landing page](https://github.com/HRNet/HRNet-Object-Detection) for object detection.
+
+Unfortunately, the OneDrive location which is used to host the model is using a temporary authentication token, so there is no way for us to script up model download. There are two ways to upload and use the pre-trained HRNet model on DS VM:
+- download the model to your local drive using a web browser of your choice and then upload the model to the DS VM using something like `scp`; navigate to Portal and copy DS VM's public IP from the Overview panel of your DS VM (you can search your DS VM by name in the search bar of the Portal) then use `scp local_model_location username@DS_VM_public_IP:./model/save/path` to upload
+- alternatively, you can use the same public IP to open remote desktop over SSH to your Linux VM using [X2Go](https://wiki.x2go.org/doku.php/download:start): you can basically open the web browser on your VM this way and download the model to VM's disk
--- a/contrib/experiments/interpretation/dutchf3_section/README.md
+++ b/contrib/experiments/interpretation/dutchf3_section/README.md
@ -19,7 +19,7 @@ Now you're all set to run training and testing experiments on the F3 Netherlands
 ### Monitoring progress with TensorBoard
 - from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
 written to the `output` folder  
- open a web-browser and go to  either vmpublicip:6006 if running remotely or localhost:6006 if running locally  
+- open a web-browser and go to  either `<vm_public_ip>:6006` if running remotely or localhost:6006 if running locally  
 > **NOTE**:If running remotely remember that the port must be open and accessible 
 
 More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).
--- a/contrib/experiments/interpretation/penobscot/README.md
+++ b/contrib/experiments/interpretation/penobscot/README.md
@ -20,7 +20,7 @@ Also follow instructions for [downloading and preparing](../../../README.md#peno
 ### Monitoring progress with TensorBoard
 - from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
 written to the `output` folder  
- open a web-browser and go to  either vmpublicip:6006 if running remotely or localhost:6006 if running locally  
+- open a web-browser and go to  either `<vm_public_ip>:6006` if running remotely or `localhost:6006` if running locally  
 > **NOTE**:If running remotely remember that the port must be open and accessible 
 
 More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).
--- a/contrib/scripts/run_all.sh
+++ b/contrib/scripts/run_all.sh
@ -39,7 +39,7 @@ nohup time python train.py \
 # wait for python to pick up the runtime env before switching it
 sleep 1

-cd ../../dutchf3_patch/local
+cd ../../dutchf3_patch

 # patch based without skip connections
 export CUDA_VISIBLE_DEVICES=2
--- a/contrib/scripts/run_distributed.sh
+++ b/contrib/scripts/run_distributed.sh
@ -1,7 +1,11 @@
 #!/bin/bash

 # number of GPUs to train on
-NGPU=8
+NGPUS=$(nvidia-smi -L | wc -l)
+if [ "$NGPUS" -lt "2" ]; then
+    echo "ERROR: cannot run distributed training without 2 or more GPUs."
+    exit 1
+fi
 # specify pretrained HRNet backbone
 PRETRAINED_HRNET='/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth'
 # DATA_F3='/home/alfred/data/dutch/data'
@ -15,9 +19,8 @@ unset CUDA_VISIBLE_DEVICES
 # bug to fix conda not launching from a bash shell
 source /data/anaconda/etc/profile.d/conda.sh
 conda activate seismic-interpretation
-export PYTHONPATH=/storage/repos/forks/seismic-deeplearning-1/interpretation:$PYTHONPATH

-cd experiments/interpretation/dutchf3_patch/distributed/
+cd experiments/interpretation/dutchf3_patch/

 # patch based without skip connections
 nohup time python -m torch.distributed.launch --nproc_per_node=${NGPU} train.py \
--- a/contrib/scripts/test_all.sh
+++ b/contrib/scripts/test_all.sh
@ -59,7 +59,7 @@ nohup time python test.py \
    --cfg "configs/${CONFIG_NAME}.yaml" > ${CONFIG_NAME}_test.log 2>&1 &
 sleep 1

-cd ../../dutchf3_patch/local
+cd ../../dutchf3_patch

 # patch based without skip connections
 export CUDA_VISIBLE_DEVICES=2
@ -140,7 +140,7 @@ wait

 # scoring scripts are in the local folder
 # models are in the distributed folder
-cd ../../dutchf3_patch/local
+cd ../../dutchf3_patch

 # patch based without skip connections
 export CUDA_VISIBLE_DEVICES=2
--- a/contrib/tests/cicd/aml_build.yml
+++ b/contrib/tests/cicd/aml_build.yml
@ -0,0 +1,110 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# Pull request against these branches will trigger this build
+pr:
+- master
+- staging
+- contrib
+
+# Any commit to this branch will trigger the build.
+trigger:
+- master
+- staging
+- contrib
+
+jobs:
+
+# partially disable setup for now - done manually on build VM
+- job: setup
+  timeoutInMinutes: 10
+  displayName: Setup
+  pool:
+    name: deepseismicagentpool
+  steps:
+    - bash: |
+        # terminate as soon as any internal script fails
+        set -e
+
+        echo "Running setup..."
+        pwd
+        ls
+        git branch
+        uname -ra
+
+# TODO: uncomment in the next release to bring back AML
+#        # setup run environment
+#        ./scripts/env_reinstall.sh
+#
+#        # use hardcoded root for now because not sure how env changes under ADO policy
+#        DATA_ROOT="/home/alfred/data_dynamic"
+#        ./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}
+#
+#        # upload pre-processed data to AML build WASB storage - overwrites by default and auto-creates container name
+#        azcopy --quiet --recursive \
+#          --source ${DATA_ROOT}/dutch_f3/data --destination https://${BLOB_ACCOUNT_NAME}.blob.core.windows.net/${BLOB_CONTAINER_NAME}/data \
+#          --dest-key ${BLOB_ACCOUNT_KEY}
+#      env:
+#        BLOB_ACCOUNT_NAME: $(amlbuildstore)
+#        BLOB_CONTAINER_NAME: "amlbuild"
+#        BLOB_ACCOUNT_KEY: $(amlbuildstorekey)
+#
+#
+#- job: AML_pipeline_tests
+#  dependsOn: setup
+#  timeoutInMinutes: 20
+#  displayName: AML pipeline tests
+#  pool:
+#    name: deepseismicagentpool
+#  steps:
+#  - bash: |
+#      source activate seismic-interpretation
+#      # TODO: add code which launches your pytest files ("pytest sometest" OR "python test.py")
+#      # data is in  $(amlbuildstore).blob.core.windows.net/amlbuild/data (container amlbuild, virtual folder data)
+#      # storage key is $(amlbuildstorekey)
+#      az --version
+#      az account show
+#      az login --service-principal -u $SPIDENTITY -p $SPECRET --tenant $SPTENANT
+#      az account set --subscription $SUB_ID
+#      mkdir .azureml
+#      cat <<EOF > .azureml/config.json
+#      {
+#        "subscription_id": "$SUB_ID",
+#        "resource_group": "$RESOURCE_GROUP",
+#        "workspace_name": "$WORKSPACE_NAME"
+#      }
+#      EOF
+#      pytest interpretation/tests/test_train_pipeline.py || EXITCODE=123
+#      exit $EXITCODE
+#      pytest
+#    env:
+#      SUB_ID: $(subscription_id)
+#      RESOURCE_GROUP: $(resource_group)
+#      WORKSPACE_NAME: $(workspace_name)
+#      BLOB_ACCOUNT_NAME: $(amlbuildstore)
+#      BLOB_CONTAINER_NAME: "amlbuild"
+#      BLOB_ACCOUNT_KEY: $(amlbuildstorekey)
+#      BLOB_SUB_ID: $(subscription_id)
+#      AML_COMPUTE_CLUSTER_NAME: "testcluster"
+#      AML_COMPUTE_CLUSTER_MIN_NODES: "1"
+#      AML_COMPUTE_CLUSTER_MAX_NODES: "8"
+#      AML_COMPUTE_CLUSTER_SKU: "STANDARD_NC6"
+#      SPIDENTITY: $(spidentity)
+#      SPECRET: $(spsecret)
+#      SPTENANT: $(sptenant)
+#    displayName: 'integration tests'
+
+# - job: AML_short_pipeline_test
+#   dependsOn: setup
+#   timeoutInMinutes: 5
+#   displayName: AML short pipeline test
+#   pool:
+#     name: deepseismicagentpool
+#   steps:
+#   - bash: |      
+#       source activate seismic-interpretation
+#       # TODO: OPTIONAL! Add a job which launches entire training pipeline for 1 epoch of training (train model for single epoch)
+#       # if you don't want this then delete the entire job from this file
+#       python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py --experiment=DEV-train-pipeline-name --orchestrator_config=orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json"
+
+
--- a/cv_lib/cv_lib/init.py
+++ b/cv_lib/cv_lib/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/cv_lib/cv_lib/event_handlers/init.py
+++ b/cv_lib/cv_lib/event_handlers/init.py
@ -31,7 +31,7 @@ class SnapshotHandler:
    def __call__(self, engine, to_save):
        self._checkpoint_handler(engine, to_save)
        if self._snapshot_function():
-            files = glob.glob(os.path.join(self._model_save_location, self._running_model_prefix + "*"))            
+            files = glob.glob(os.path.join(self._model_save_location, self._running_model_prefix + "*"))
            name_postfix = os.path.basename(files[0]).lstrip(self._running_model_prefix)
            copyfile(
                files[0],
--- a/cv_lib/cv_lib/event_handlers/azureml_handlers.py
+++ b/cv_lib/cv_lib/event_handlers/azureml_handlers.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/cv_lib/cv_lib/event_handlers/tensorboard_handlers.py
+++ b/cv_lib/cv_lib/event_handlers/tensorboard_handlers.py
@ -10,6 +10,7 @@ from toolz import curry
 from cv_lib.segmentation.dutchf3.utils import np_to_tb
 from cv_lib.utils import decode_segmap

+
 def create_summary_writer(log_dir):
    writer = SummaryWriter(logdir=log_dir)
    return writer
@ -20,9 +21,9 @@ def _transform_image(output_tensor):
    return torchvision.utils.make_grid(output_tensor, normalize=True, scale_each=True)


-def _transform_pred(output_tensor):
+def _transform_pred(output_tensor, n_classes):
    output_tensor = output_tensor.squeeze().cpu().numpy()
-    decoded = decode_segmap(output_tensor)
+    decoded = decode_segmap(output_tensor, n_classes)
    return torchvision.utils.make_grid(np_to_tb(decoded), normalize=False, scale_each=False)


@ -111,5 +112,5 @@ def log_results(engine, evaluator, summary_writer, n_classes, stage):
    y_pred[mask == 255] = 255

    summary_writer.add_image(f"{stage}/Image", _transform_image(image), epoch)
-    summary_writer.add_image(f"{stage}/Mask", _transform_pred(mask), epoch)
-    summary_writer.add_image(f"{stage}/Pred", _transform_pred(y_pred), epoch)
+    summary_writer.add_image(f"{stage}/Mask", _transform_pred(mask, n_classes), epoch)
+    summary_writer.add_image(f"{stage}/Pred", _transform_pred(y_pred, n_classes), epoch)
--- a/cv_lib/cv_lib/segmentation/dutchf3/init.py
+++ b/cv_lib/cv_lib/segmentation/dutchf3/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/cv_lib/cv_lib/segmentation/dutchf3/utils.py
+++ b/cv_lib/cv_lib/segmentation/dutchf3/utils.py
@ -37,4 +37,3 @@ def git_branch():
 def git_hash():
    repo = Repo(search_parent_directories=True)
    return repo.active_branch.commit.hexsha
-
--- a/cv_lib/cv_lib/segmentation/models/patch_deconvnet_skip.py
+++ b/cv_lib/cv_lib/segmentation/models/patch_deconvnet_skip.py
@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
        cfg.MODEL.IN_CHANNELS == 1
    ), f"Patch deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
    model = patch_deconvnet_skip(n_classes=cfg.DATASET.NUM_CLASSES)
+
    return model
--- a/cv_lib/cv_lib/segmentation/models/resnet_unet.py
+++ b/cv_lib/cv_lib/segmentation/models/resnet_unet.py
@ -1,11 +1,16 @@
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT License.

+import logging
+import os
+
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import torchvision

+logger = logging.getLogger(__name__)
+

 class FPAv2(nn.Module):
    def __init__(self, input_dim, output_dim):
--- a/cv_lib/cv_lib/segmentation/models/section_deconvnet.py
+++ b/cv_lib/cv_lib/segmentation/models/section_deconvnet.py
@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
        cfg.MODEL.IN_CHANNELS == 1
    ), f"Section deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
    model = section_deconvnet(n_classes=cfg.DATASET.NUM_CLASSES)
+
    return model
--- a/cv_lib/cv_lib/segmentation/models/section_deconvnet_skip.py
+++ b/cv_lib/cv_lib/segmentation/models/section_deconvnet_skip.py
@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
        cfg.MODEL.IN_CHANNELS == 1
    ), f"Section deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
    model = section_deconvnet_skip(n_classes=cfg.DATASET.NUM_CLASSES)
+
    return model
--- a/cv_lib/cv_lib/segmentation/models/seg_hrnet.py
+++ b/cv_lib/cv_lib/segmentation/models/seg_hrnet.py
@ -430,21 +430,20 @@ class HighResolutionNet(nn.Module):

        if pretrained and not os.path.isfile(pretrained):
            raise FileNotFoundError(f"The file {pretrained} was not found. Please supply correct path or leave empty")
-        
+
        if os.path.isfile(pretrained):
            pretrained_dict = torch.load(pretrained)
            logger.info("=> loading pretrained model {}".format(pretrained))
            model_dict = self.state_dict()
            pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict.keys()}
            for k, _ in pretrained_dict.items():
-               logger.info(
-                   '=> loading {} pretrained model {}'.format(k, pretrained))
+                logger.info("=> loading {} pretrained model {}".format(k, pretrained))
            model_dict.update(pretrained_dict)
            self.load_state_dict(model_dict)


 def get_seg_model(cfg, **kwargs):
    model = HighResolutionNet(cfg, **kwargs)
-    model.init_weights(cfg.MODEL.PRETRAINED)
-
+    if "PRETRAINED" in cfg.MODEL.keys():
+        model.init_weights(cfg.MODEL.PRETRAINED)
    return model
--- a/cv_lib/cv_lib/segmentation/models/unet.py
+++ b/cv_lib/cv_lib/segmentation/models/unet.py
@ -113,4 +113,5 @@ class UNet(nn.Module):

 def get_seg_model(cfg, **kwargs):
    model = UNet(cfg.MODEL.IN_CHANNELS, cfg.DATASET.NUM_CLASSES)
+
    return model
--- a/cv_lib/cv_lib/segmentation/utils.py
+++ b/cv_lib/cv_lib/segmentation/utils.py
@ -3,7 +3,6 @@

 import numpy as np

+
 def _chw_to_hwc(image_array_numpy):
    return np.moveaxis(image_array_numpy, 0, -1)
-
-
--- a/cv_lib/cv_lib/utils.py
+++ b/cv_lib/cv_lib/utils.py
@ -8,13 +8,17 @@ import numpy as np
 from matplotlib import pyplot as plt


-def normalize(array):
+def normalize(array, MIN, MAX):
    """
-    Normalizes a segmentation mask array to be in [0,1] range
-    for use with PIL.Image
+    Normalizes a segmentation image array by the global range of the data, 
+    MIN and MAX, for use with PIL.Image
    """
-    min = array.min()
-    return (array - min) / (array.max() - min)
+
+    den = MAX - MIN
+    if den == 0:
+        den += np.finfo(float).eps
+
+    return (array - MIN) / den


 def mask_to_disk(mask, fname, n_classes, cmap_name="rainbow"):
@ -30,15 +34,15 @@ def mask_to_disk(mask, fname, n_classes, cmap_name="rainbow"):
    Image.fromarray(cmap(mask / n_classes, bytes=True)).save(fname)


-def image_to_disk(mask, fname, cmap_name="seismic"):
+def image_to_disk(image, fname, MIN, MAX, cmap_name="seismic"):
    """
    write segmentation image to disk using a particular colormap
    """
    cmap = plt.get_cmap(cmap_name)
-    Image.fromarray(cmap(normalize(mask), bytes=True)).save(fname)
+    Image.fromarray(cmap(normalize(image, MIN, MAX), bytes=True)).save(fname)


-def decode_segmap(label_mask, colormap_name="rainbow"):
+def decode_segmap(label_mask, n_classes, colormap_name="rainbow"):
    """
    Decode segmentation class labels into a colour image
        Args:
@ -51,7 +55,7 @@ def decode_segmap(label_mask, colormap_name="rainbow"):
    cmap = plt.get_cmap(colormap_name)
    # loop over the batch
    for i in range(label_mask.shape[0]):
-        im = Image.fromarray(cmap(normalize(label_mask[i, :, :]), bytes=True)).convert("RGB")
+        im = Image.fromarray(cmap((label_mask[i, :, :] / n_classes), bytes=True)).convert("RGB")
        out[i, :, :, :] = np.array(im).swapaxes(0, 2).swapaxes(1, 2)

    return out
--- a/cv_lib/tests/test_metrics.py
+++ b/cv_lib/tests/test_metrics.py
@ -1,3 +1,6 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
 import torch
 import numpy as np
 from pytest import approx
--- a/docker/README.md
+++ b/docker/README.md
@ -2,7 +2,7 @@ This Docker image allows the user to run the notebooks in this repository on any

 # Download the HRNet model: 

-To run the [`Dutch_F3_patch_model_training_and_evaluation.ipynb`](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb), you will need to manually download the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pretrained model. You can follow the instructions [here.](../README.md#pretrained-models). 
+To run the [`Dutch_F3_patch_model_training_and_evaluation.ipynb`](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb), you will need to manually download the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pretrained model. You can follow the instructions [here](../README.md#pretrained-models). 

 If you are using an Azure Virtual Machine to run this code, you can download the model to your local machine, and then copy it to your Azure VM through the command below. Please make sure you update the `<azureuser>` and `<azurehost>` feilds.
 ```bash
--- a/environment/anaconda/local/environment.yml
+++ b/environment/anaconda/local/environment.yml
@ -12,7 +12,7 @@ dependencies:
  - torchvision>=0.5.0
  - pandas==0.25.3
  - scikit-learn==0.21.3
-  - tensorflow==2.0
+  - tensorflow==2.1.0
  - opt-einsum>=2.3.2
  - tqdm==4.39.0
  - itkwidgets==0.23.1
@ -39,4 +39,3 @@ dependencies:
    - jupytext==1.3.0
    - validators
    - pyyaml
-
--- a/examples/interpretation/README.md
+++ b/examples/interpretation/README.md
@ -1,5 +1,5 @@
 The folder contains notebook examples illustrating the use of segmentation algorithms on openly available datasets. Make sure you have followed the [set up instructions](../../README.md) before running these examples. We provide the following notebook examples 

-* [Dutch F3 dataset](notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb): This notebook illustrates section and patch based segmentation approaches on the [Dutch F3](https://terranubis.com/datainfo/Netherlands-Offshore-F3-Block-Complete) open dataset. This notebook uses denconvolution based segmentation algorithm on 2D patches. The notebook will guide you through visualization of the input volume, setting up model training and evaluation. 
+* [Dutch F3 dataset](notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb): This notebook illustrates section and patch based segmentation approaches on the [Dutch F3](https://terranubis.com/datainfo/Netherlands-Offshore-F3-Block-Complete) open dataset. This notebook uses deconvolution based segmentation algorithm on 2D patches. The notebook will guide you through visualization of the input volume, setting up model training and evaluation. 

-To understand the configuration files and the dafault parameters refer to this [section in the top level README](../../README.md#configuration-files)
+To understand the configuration files and the default parameters refer to this [section in the top level README](../../README.md#configuration-files)
--- a/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb
+++ b/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb
@ -59,7 +59,7 @@
   "source": [
    "# load an existing experiment configuration file\n",
    "CONFIG_FILE = (\n",
-    "    \"../../../experiments/interpretation/dutchf3_patch/local/configs/hrnet.yaml\"\n",
+    "    \"../../../experiments/interpretation/dutchf3_patch/configs/seresnet_unet.yaml\"\n",
    ")\n",
    "# number of images to score\n",
    "N_EVALUATE = 20\n",
@ -239,7 +239,7 @@
    "max_snapshots = config.TRAIN.SNAPSHOTS\n",
    "papermill = False\n",
    "dataset_root = config.DATASET.ROOT\n",
-    "model_pretrained = config.MODEL.PRETRAINED"
+    "model_pretrained = config.MODEL.PRETRAINED if \"PRETRAINED\" in config.MODEL.keys() else None"
   ]
  },
  {
@ -511,23 +511,17 @@
    "TrainPatchLoader = get_patch_loader(config)\n",
    "\n",
    "train_set = TrainPatchLoader(\n",
-    "    config.DATASET.ROOT,\n",
-    "    config.DATASET.NUM_CLASSES,\n",
+    "    config,\n",
    "    split=\"train\",\n",
    "    is_transform=True,\n",
-    "    stride=config.TRAIN.STRIDE,\n",
-    "    patch_size=config.TRAIN.PATCH_SIZE,\n",
    "    augmentations=train_aug,\n",
    ")\n",
    "n_classes = train_set.n_classes\n",
    "logger.info(train_set)\n",
    "val_set = TrainPatchLoader(\n",
-    "    config.DATASET.ROOT,\n",
-    "    config.DATASET.NUM_CLASSES,\n",
+    "    config,\n",
    "    split=\"val\",\n",
    "    is_transform=True,\n",
-    "    stride=config.TRAIN.STRIDE,\n",
-    "    patch_size=config.TRAIN.PATCH_SIZE,\n",
    "    augmentations=val_aug,\n",
    ")\n",
    "\n",
@ -865,9 +859,17 @@
   "outputs": [],
   "source": [
    "# use the model which we just fine-tuned\n",
-    "opts = [\"TEST.MODEL_PATH\", path.join(output_dir, f\"model_f3_nb_seg_hrnet_{train_len}.pth\")]\n",
+    "if \"hrnet\" in config.MODEL.NAME:\n",
+    "    model_snapshot_name = f\"model_f3_nb_seg_hrnet_{train_len}.pth\"\n",
+    "elif \"resnet\" in config.MODEL.NAME:    \n",
+    "    model_snapshot_name = f\"model_f3_nb_resnet_unet_{train_len}.pth\"\n",
+    "else:\n",
+    "    raise NotImplementedError(\"We don't support testing this model in this notebook yet\")\n",
+    "    \n",
+    "opts = [\"TEST.MODEL_PATH\", path.join(output_dir, model_snapshot_name)]\n",
    "# uncomment the line below to use the pre-trained model instead\n",
    "# opts = [\"TEST.MODEL_PATH\", config.MODEL.PRETRAINED]\n",
+    "\n",
    "config.merge_from_list(opts)"
   ]
  },
@ -877,7 +879,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "model.load_state_dict(torch.load(config.TEST.MODEL_PATH))\n",
+    "trained_model = torch.load(config.TEST.MODEL_PATH)\n",
+    "trained_model = {k.replace(\"module.\", \"\"): v for (k, v) in trained_model.items()}\n",
+    "model.load_state_dict(trained_model, strict=True)\n",
    "model = model.to(device)"
   ]
  },
@ -932,7 +936,7 @@
    "# Load test data\n",
    "TestSectionLoader = get_test_loader(config)\n",
    "test_set = TestSectionLoader(\n",
-    "    config.DATASET.ROOT, config.DATASET.NUM_CLASSES, split=split, is_transform=True, augmentations=section_aug\n",
+    "    config, split=split, is_transform=True, augmentations=section_aug\n",
    ")\n",
    "# needed to fix this bug in pytorch https://github.com/pytorch/pytorch/issues/973\n",
    "# one of the workers will quit prematurely\n",
--- a/examples/interpretation/notebooks/utilities.py
+++ b/examples/interpretation/notebooks/utilities.py
@ -23,9 +23,9 @@ class runningScore(object):

    def _fast_hist(self, label_true, label_pred, n_class):
        mask = (label_true >= 0) & (label_true < n_class)
-        hist = np.bincount(
-            n_class * label_true[mask].astype(int) + label_pred[mask], minlength=n_class ** 2,
-        ).reshape(n_class, n_class)
+        hist = np.bincount(n_class * label_true[mask].astype(int) + label_pred[mask], minlength=n_class ** 2,).reshape(
+            n_class, n_class
+        )
        return hist

    def update(self, label_trues, label_preds):
@ -152,9 +152,7 @@ def compose_processing_pipeline(depth, aug=None):


 def _generate_batches(h, w, ps, patch_size, stride, batch_size=64):
-    hdc_wdx_generator = itertools.product(
-        range(0, h - patch_size + ps, stride), range(0, w - patch_size + ps, stride)
-    )
+    hdc_wdx_generator = itertools.product(range(0, h - patch_size + ps, stride), range(0, w - patch_size + ps, stride))

    for batch_indexes in itertoolz.partition_all(batch_size, hdc_wdx_generator):
        yield batch_indexes
@ -166,9 +164,7 @@ def output_processing_pipeline(config, output):
    _, _, h, w = output.shape
    if config.TEST.POST_PROCESSING.SIZE != h or config.TEST.POST_PROCESSING.SIZE != w:
        output = F.interpolate(
-            output,
-            size=(config.TEST.POST_PROCESSING.SIZE, config.TEST.POST_PROCESSING.SIZE),
-            mode="bilinear",
+            output, size=(config.TEST.POST_PROCESSING.SIZE, config.TEST.POST_PROCESSING.SIZE), mode="bilinear",
        )

    if config.TEST.POST_PROCESSING.CROP_PIXELS > 0:
@ -183,15 +179,7 @@ def output_processing_pipeline(config, output):


 def patch_label_2d(
-    model,
-    img,
-    pre_processing,
-    output_processing,
-    patch_size,
-    stride,
-    batch_size,
-    device,
-    num_classes,
+    model, img, pre_processing, output_processing, patch_size, stride, batch_size, device, num_classes,
 ):
    """Processes a whole section"""
    img = torch.squeeze(img)
@ -205,19 +193,14 @@ def patch_label_2d(
    # generate output:
    for batch_indexes in _generate_batches(h, w, ps, patch_size, stride, batch_size=batch_size):
        batch = torch.stack(
-            [
-                pipe(img_p, _extract_patch(hdx, wdx, ps, patch_size), pre_processing)
-                for hdx, wdx in batch_indexes
-            ],
+            [pipe(img_p, _extract_patch(hdx, wdx, ps, patch_size), pre_processing) for hdx, wdx in batch_indexes],
            dim=0,
        )

        model_output = model(batch.to(device))
        for (hdx, wdx), output in zip(batch_indexes, model_output.detach().cpu()):
            output = output_processing(output)
-            output_p[
-                :, :, hdx + ps : hdx + ps + patch_size, wdx + ps : wdx + ps + patch_size
-            ] += output
+            output_p[:, :, hdx + ps : hdx + ps + patch_size, wdx + ps : wdx + ps + patch_size] += output

    # crop the output_p in the middle
    output = output_p[:, :, ps:-ps, ps:-ps]
@ -325,26 +308,22 @@ def download_pretrained_model(config):
    elif "penobscot" in config.DATASET.ROOT:
        dataset = "penobscot"
    else:
-        raise NameError(
-            "Unknown dataset name. Only dutch f3 and penobscot are currently supported."
-        )
+        raise NameError("Unknown dataset name. Only dutch f3 and penobscot are currently supported.")

    if "hrnet" in config.MODEL.NAME:
        model = "hrnet"
    elif "deconvnet" in config.MODEL.NAME:
        model = "deconvnet"
-    elif "unet" in config.MODEL.NAME:
-        model = "unet"
+    elif "resnet" in config.MODEL.NAME:
+        model = "seresnetunet"
    else:
-        raise NameError(
-            "Unknown model name. Only hrnet, deconvnet, and unet are currently supported."
-        )
+        raise NameError("Unknown model name. Only hrnet, deconvnet, and seresnet_unet are currently supported.")

    # check if the user already supplied a URL, otherwise figure out the URL
-    if validators.url(config.MODEL.PRETRAINED):
+    if "PRETRAINED" in config.MODEL.keys() and validators.url(config.MODEL.PRETRAINED):
        url = config.MODEL.PRETRAINED
        print(f"Will use user-supplied URL of '{url}'")
-    elif os.path.isfile(config.MODEL.PRETRAINED):
+    elif "PRETRAINED" in config.MODEL.keys() and os.path.isfile(config.MODEL.PRETRAINED):
        url = None
        print(f"Will use user-supplied file on local disk of '{config.MODEL.PRETRAINED}'")
    else:
@ -365,38 +344,26 @@ def download_pretrained_model(config):
                url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_hrnet_patch_section_depth.pth"
            elif model == "hrnet" and config.TRAIN.DEPTH == "patch":
                url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_hrnet_patch_patch_depth.pth"
-            elif (
-                model == "deconvnet"
-                and "skip" in config.MODEL.NAME
-                and config.TRAIN.DEPTH == "none"
-            ):
+            elif model == "deconvnet" and "skip" in config.MODEL.NAME and config.TRAIN.DEPTH == "none":
                url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_deconvnetskip_patch_no_depth.pth"
-
-            elif (
-                model == "deconvnet"
-                and "skip" not in config.MODEL.NAME
-                and config.TRAIN.DEPTH == "none"
-            ):
+            elif model == "deconvnet" and "skip" not in config.MODEL.NAME and config.TRAIN.DEPTH == "none":
                url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_deconvnet_patch_no_depth.pth"
-            elif model == "unet" and config.TRAIN.DEPTH == "section":
-                url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_seresnetunet_patch_section_depth.pth"
+            elif model == "seresnetunet" and config.TRAIN.DEPTH == "section":
+                url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_seresnetunet_patch_section_depth.pth"
            else:
                raise NotImplementedError(
                    "We don't store a pretrained model for Dutch F3 for this model combination yet."
                )
+
        else:
-            raise NotImplementedError(
-                "We don't store a pretrained model for this dataset/model combination yet."
-            )
+            raise NotImplementedError("We don't store a pretrained model for this dataset/model combination yet.")

        print(f"Could not find a user-supplied URL, downloading from '{url}'")

    # make sure the model_dir directory is writeable
    model_dir = config.TRAIN.MODEL_DIR

-    if not os.path.isdir(os.path.dirname(model_dir)) or not os.access(
-        os.path.dirname(model_dir), os.W_OK
-    ):
+    if not os.path.isdir(os.path.dirname(model_dir)) or not os.access(os.path.dirname(model_dir), os.W_OK):
        print(f"Cannot write to TRAIN.MODEL_DIR={config.TRAIN.MODEL_DIR}")
        home = str(pathlib.Path.home())
        model_dir = os.path.join(home, "models")
@ -407,14 +374,10 @@ def download_pretrained_model(config):

    if url:
        # Download the pretrained model:
-        pretrained_model_path = os.path.join(
-            model_dir, "pretrained_" + dataset + "_" + model + ".pth"
-        )
+        pretrained_model_path = os.path.join(model_dir, "pretrained_" + dataset + "_" + model + ".pth")

        # always redownload the model
-        print(
-            f"Downloading the pretrained model to '{pretrained_model_path}'. This will take a few mintues.. \n"
-        )
+        print(f"Downloading the pretrained model to '{pretrained_model_path}'. This will take a few mintues.. \n")
        urllib.request.urlretrieve(url, pretrained_model_path)
        print("Model successfully downloaded.. \n")
    else:
@ -424,6 +387,11 @@ def download_pretrained_model(config):
    # Update config MODEL.PRETRAINED
    # TODO: Only HRNet uses a pretrained model currently.
    # issue https://github.com/microsoft/seismic-deeplearning/issues/267
+
+    # now that we have a pre-trained model, we can set it
+    if "PRETRAINED" not in config.MODEL.keys():
+        config.MODEL["PRETRAINED"] = "dummy"
+
    opts = [
        "MODEL.PRETRAINED",
        pretrained_model_path,
@ -432,6 +400,7 @@ def download_pretrained_model(config):
        "TEST.MODEL_PATH",
        pretrained_model_path,
    ]
+
    config.merge_from_list(opts)

    return config
--- a/examples/interpretation/segyconverter/01_segy_sample_files.ipynb
+++ b/examples/interpretation/segyconverter/01_segy_sample_files.ipynb
@ -0,0 +1,155 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation.\n",
+    "\n",
+    "Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Generate Sythetic SEGY files for testing\n",
+    "\n",
+    "This notebook builds the test data used by the convert_segy unit tests. It covers just a few of the SEG-Y files that could be encountered if you bring your own SEG-Y files for training. This is not a comprehensive set of files so there still may be situations where the segyio or the convert_segy.py utility would fail to load the SEG-Y data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import deepseismic_interpretation.segyconverter.utils.create_segy as utils\n",
+    "import segyio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create sample SEG-Y files for testing\n",
+    "\n",
+    "1. Control, that represents a perfect data, with no missing traces.\n",
+    "2. Missing traces on the top-left and bottom right of the geographic field w/ inline sorting\n",
+    "3. Missing traces on the top-left and bottom right of the geographic field w/ crossline sorting\n",
+    "4. Missing trace in the center of the geographic field w/ inline sorting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Control File\n",
+    "\n",
+    "Create a file that has a cuboid shape with traces at all inline/crosslines"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "controlfile = './normalsegy.segy'\n",
+    "utils.create_segy_file(lambda il, xl: True, controlfile)\n",
+    "utils.show_segy_details(controlfile)\n",
+    "utils.load_segy_with_geometry(controlfile)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Inline Error File\n",
+    "\n",
+    "inlineerror.segy will throw an error that inlines are not unique because it assumes the same number of inlines per crossline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inlinefile = './inlineerror.segy'\n",
+    "utils.create_segy_file(lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),\n",
+    "    inlinefile, segyio.TraceSortingFormat.INLINE_SORTING)\n",
+    "utils.show_segy_details(inlinefile)\n",
+    "# Cannot load this file with inferred geometry; segyio will fail\n",
+    "# utils.load_segy_with_geometry(inlinefile)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Crossline Error File\n",
+    "\n",
+    "xlineerror.segy will throw an error that crosslines are not unique because it assumes the same number of crosslines per inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xlineerrorfile = './xlineerror.segy'\n",
+    "utils.create_segy_file(lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),\n",
+    "    xlineerrorfile, segyio.TraceSortingFormat.CROSSLINE_SORTING)\n",
+    "utils.show_segy_details(xlineerrorfile)\n",
+    "# Cannot load this file with inferred geometry; segyio will fail\n",
+    "# utils.load_segy_with_geometry(xlineerrorfile)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cube hole SEG-Y file\n",
+    "\n",
+    "When collecting seismic data, unless in an area of open ocean, it is  rare to be able to collect all trace data from a rectangular field make the collection of traces from a uniform field \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cubehole_segyfile = './cubehole.segy'\n",
+    "utils.create_segy_file(lambda il, xl: not ((20 < il < 30) and (150 < xl < 250)),\n",
+    "    cubehole_segyfile, segyio.TraceSortingFormat.INLINE_SORTING)\n",
+    "utils.show_segy_details(cubehole_segyfile)\n",
+    "# Cannot load this file with inferred geometry; segyio will fail\n",
+    "# utils.load_segy_with_geometry(cubehole_segyfile)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "seismic-interpretation",
+   "language": "python",
+   "name": "seismic-interpretation"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/examples/interpretation/segyconverter/02_segy_convert_sample.ipynb
+++ b/examples/interpretation/segyconverter/02_segy_convert_sample.ipynb
@ -0,0 +1,158 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation.\n",
+    "\n",
+    "Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Converting SEG-Y files for training or validation\n",
+    "\n",
+    "This notebook describes how to prepare your own SEG-Y files for training.\n",
+    "\n",
+    "If you don’t have your owns SEG-Y file, you can run *01_segy_sample_files.jpynb* notebook for generating synthetics files.\n",
+    "\n",
+    "To use your own SEG-Y volumes to train models in the DeepSeismic repo, you need to bring at least one pair of ground truth and label data SEG-Y files where the files have an identical shape. The seismic data file contains  typical SEG-Y post stack data traces and the label data file should contain an integer class label at every sample in each trace.\n",
+    "\n",
+    "For each SEG-Y file, run the convert_segy.py script to create a npy file. Optionally, you can normalize and/or clip the data in the SEG-Y file as it is converted to npy.\n",
+    "\n",
+    "Once you have a pair of ground truth and related label npy files, you can edit one of the training scripts in the repo to use these files. One example is the [dutchf3 train.py](../../experiments/interpretation/dutchf3_patch/train.py) script.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from itkwidgets import view\n",
+    "import numpy as np\n",
+    "import os\n",
+    "\n",
+    "SEGYFILE= './normalsegy.segy'\n",
+    "PREFIX='normalsegy'\n",
+    "OUTPUTDIR='data'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## convert_segy.py usage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ./convert_segy.py --help"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Example run\n",
+    "\n",
+    "Convert the SEG-Y file to a single output npy file in the local directory. Do not normalize or clip the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ./convert_segy.py --prefix {PREFIX} --input_file {SEGYFILE} --output_dir {OUTPUTDIR} --clip"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Post processing instructions\n",
+    "\n",
+    "There should now be on npy file in the local directory named donuthole_10_100_00000.npy. The number relate to the anchor point\n",
+    "of the array. In this case, inline 10, crossline 100, and depth 0 is the origin [0,0,0] of the array.\n",
+    "\n",
+    "Rerun the convert_segy script for the related label file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "npydata = np.load(f\"./{OUTPUTDIR}/{PREFIX}_10_100_00000.npy\")\n",
+    "view(npydata, slicing_planes=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare train/test splits file\n",
+    "\n",
+    "Once the data and label segy files are converted to npy, use the `prepare_dutchf3.py` script on the resulting npy file to generate the list of patches as input to the train script.\n",
+    "\n",
+    "In the next cell is a example of how to run this script. Note that we are using the same npy (normalsegy_10_100_00000.npy) file as seismic and labels because it is only for ilustration purposes.\n",
+    "\n",
+    "Also, once you've prepared the data set, you'll find your files in the following directory tree:   \n",
+    "\n",
+    "data_dir   \n",
+    "├── output_dir   \n",
+    "├── split    \n",
+    "│&emsp;   ├── section_train.txt   \n",
+    "│&emsp;   ├── section_train_val.txt   \n",
+    "│&emsp;   ├── section_val.txt "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ../../../scripts/prepare_dutchf3.py split_train_val section --data_dir={OUTPUTDIR} --label_file={PREFIX}_10_100_00000.npy --output_dir=splits --section_stride=2 --log_config=None --split_direction=both"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "seismic-interpretation",
+   "language": "python",
+   "name": "seismic-interpretation"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/examples/interpretation/segyconverter/README.md
+++ b/examples/interpretation/segyconverter/README.md
@ -0,0 +1,67 @@
+# SEG-Y Data Utilities
+
+SEG-Y files can have a lot of variability which makes it difficult to infere the geometry when converting to npy. The segyio module attempts to do so but fails if there are missing traces in the file (which happens regularly). This utility reads traces using segyio with the inferencing turned off to avoid data loading errors and it uses its own logic to place traces into a numpy array. If traces are missing, the values of the npy array in that location are set to zero
+
+## convert_segy.py script
+
+The `convert_segy.py` script can work with SEG-Y files and output data on  local disk. This script will process segy files regardless of their structure and output npy files for use in training/scoring. In addition to the npy files, it will write a json file that includes the standard deviation and mean of the original data. The script can additionally use that to normalize and clip that data if indicated in the command line parameters
+
+The resulting npy files will use the following naming convention:
+
+```<prefix>_<inline id>_<xline id>_<depth>.npy```
+
+These inline and xline ids are the upper left location of the data contained in the file and can be later used to identify where the npy file is located in the segy data.
+
+This script use [segyio](https://github.com/equinor/segyio) for interaction with SEG-Y.
+
+To use this script, first activate the `seismic-interpretation` environment defined in this repository's setup instructions in the main [README](../../../README.md) file: 
+
+`conda activate seismic-interpretation`
+
+Then follow these examples:
+
+1) Convert a SEG-Y file to a single npy file of the same dimensions:
+
+    ```
+    python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir .
+    ```
+
+2) Convert a SEG-Y file to a single npy file of the same dimensions, clip and normalize the results:
+
+    ```
+    python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --normalize
+    ```
+
+3) Convert a SEG-Y file to a single npy file of the same dimensions, clip but do not normalize the results:
+
+    ```
+    python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --clip
+    ```
+
+4) Split a single SEG-Y file into a set of npy files, each npy array with dimension (100,100,100)
+
+    ```
+    python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --cube_size 100
+    ```
+
+There are several additional command line arguments that may be needed to load specific segy files (i.e. the byte locations for data headers may be different). Run --help to review the additional commands if needed.
+
+Documentation about the SEG-Y format can be found [here](https://seg.org/Portals/0/SEG/News%20and%20Resources/Technical%20Standards/seg_y_rev2_0-mar2017.pdf).
+Regarding data headers, we've found from the industry that those inline and crossline header location standards aren't always followed.
+As a result, you will need to print out the text header of the SEG-Y file and read the comments to determine what location was used. 
+As far as we know, there is no way to programmatically extract this info from the file.
+
+NOTE: Missing traces will be filled in with zero values. A future enhancement to this script should allow for specific values to be used that can be ignored during training.
+
+## Testing
+
+Run [pytest](https://docs.pytest.org/en/latest/getting-started.html) from the segyconverter directory to run the local unit tests.   
+
+For running all scripts available in [test foder](../../../interpretation/deepseismic_interpretation/segyconverter/test):
+    ```
+    pytest test
+    ```   
+For running a specif script:
+    ```
+    pytest test/<script_name.py>
+    ```
--- a/examples/interpretation/segyconverter/convert_segy.py
+++ b/examples/interpretation/segyconverter/convert_segy.py
@ -0,0 +1 @@
+../../../interpretation/deepseismic_interpretation/segyconverter/convert_segy.py
--- a/experiments/interpretation/dutchf3_patch/README.md
+++ b/experiments/interpretation/dutchf3_patch/README.md
@ -1,29 +1,30 @@
-## F3 Netherlands Patch Experiments
+## Dutch F3 Patch Experiments
 In this folder are training and testing scripts that work on the F3 Netherlands dataset. 
 You can run five different models on this dataset:
-* [HRNet](local/configs/hrnet.yaml)
-* [SEResNet](local/configs/seresnet_unet.yaml)
-* [UNet](local/configs/unet.yaml)
-* [PatchDeconvNet](local/configs/patch_deconvnet.yaml)
-* [PatchDeconvNet-Skip](local/configs/patch_deconvnet_skip.yaml)
+
+* [HRNet](configs/hrnet.yaml)
+* [SEResNet](configs/seresnet_unet.yaml)
+* [UNet](configs/unet.yaml)
+* [PatchDeconvNet](configs/patch_deconvnet.yaml)
+* [PatchDeconvNet-Skip](configs/patch_deconvnet_skip.yaml)

 All these models take 2D patches of the dataset as input and provide predictions for those patches. The patches need to be stitched together to form a whole inline or crossline.

-To understand the configuration files and the dafault parameters refer to this [section in the top level README](../../../README.md#configuration-files)
+To understand the configuration files and the default parameters refer to this [section in the top level README](../../../README.md#configuration-files)

 ### Setup

-Please set up a conda environment following the instructions in the top-level [README.md](../../../README.md#setting-up-environment) file.
-Also follow instructions for [downloading and preparing](../../../README.md#f3-Netherlands) the data.
+Please set up a conda environment following the instructions in the top-level [README.md](../../../README.md#setting-up-environment) file. Also follow instructions for [downloading and preparing](../../../README.md#f3-Netherlands) the data.

 ### Running experiments

-Now you're all set to run training and testing experiments on the F3 Netherlands dataset. Please start from the `train.sh` and `test.sh` scripts under the `local/` directory, which invoke the corresponding python scripts. Take a look at the project configurations in (e.g in `default.py`) for experiment options and modify if necessary. 
+Now you're all set to run training and testing experiments on the Dutch F3 dataset. Please start from the `train.sh` and `test.sh` scripts, which invoke the corresponding python scripts. If you have a multi-GPU machine, you can also train the model in a distributed fashion by running `train_distributed.sh`. Take a look at the project configurations in (e.g in `default.py`) for experiment options and modify if necessary. 
+
+Please note that we use [NVIDIA's NCCL](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html) library to enable distributed training. Please follow the installation instructions [here](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html#down) to install NCCL on your system.   

 ### Monitoring progress with TensorBoard
- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
-written to the `output` folder  
- open a web-browser and go to  either vmpublicip:6006 if running remotely or localhost:6006 if running locally  
+- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is written to the `output` folder  
+- open a web-browser and go to either `<vm_public_ip>:6006` if running remotely or `localhost:6006` if running locally  
 > **NOTE**:If running remotely remember that the port must be open and accessible 
 
 More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).
--- a/experiments/interpretation/dutchf3_patch/azureml_requirements.txt
+++ b/experiments/interpretation/dutchf3_patch/azureml_requirements.txt
@ -0,0 +1,13 @@
+git+https://github.com/microsoft/seismic-deeplearning.git@contrib#egg=cv_lib&subdirectory=cv_lib
+git+https://github.com/microsoft/seismic-deeplearning.git#egg=deepseismic-interpretation&subdirectory=interpretation
+opencv-python==4.1.2.30
+numpy>=1.17.0
+torch==1.4.0
+pytorch-ignite==0.3.0.dev20191105 # pre-release until stable available
+fire==0.2.1
+albumentations==0.4.3
+toolz==0.10.0
+segyio==1.8.8
+scipy==1.1.0
+gitpython==3.0.5
+yacs==0.1.6
--- a/experiments/interpretation/dutchf3_patch/local/configs/hrnet.yaml
+++ b/experiments/interpretation/dutchf3_patch/local/configs/hrnet.yaml
@ -16,6 +16,8 @@ DATASET:
  NUM_CLASSES: 6
  ROOT: "/home/username/data/dutch/data"
  CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+  MIN: -1
+  MAX: 1


 MODEL:
--- a/experiments/interpretation/dutchf3_patch/local/configs/patch_deconvnet.yaml
+++ b/experiments/interpretation/dutchf3_patch/local/configs/patch_deconvnet.yaml
@ -14,6 +14,8 @@ DATASET:
  NUM_CLASSES: 6
  ROOT: /home/username/data/dutch/data
  CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+  MIN: -1
+  MAX: 1

 MODEL:
  NAME: patch_deconvnet
--- a/experiments/interpretation/dutchf3_patch/local/configs/patch_deconvnet_skip.yaml
+++ b/experiments/interpretation/dutchf3_patch/local/configs/patch_deconvnet_skip.yaml
@ -14,6 +14,8 @@ DATASET:
  NUM_CLASSES: 6
  ROOT: /home/username/data/dutch/data
  CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+  MIN: -1
+  MAX: 1

 MODEL:
  NAME: patch_deconvnet_skip
--- a/experiments/interpretation/dutchf3_patch/local/configs/seresnet_unet.yaml
+++ b/experiments/interpretation/dutchf3_patch/local/configs/seresnet_unet.yaml
@ -9,12 +9,15 @@ WORKERS: 4
 PRINT_FREQ: 10
 LOG_CONFIG: logging.conf
 SEED: 2019
+OPENCV_BORDER_CONSTANT: 0


 DATASET:
  NUM_CLASSES: 6
-  ROOT: /home/username/data/dutch/data
+  ROOT: "/home/username/data/dutch/data"
  CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+  MIN: -1
+  MAX: 1

 MODEL:
  NAME: resnet_unet
--- a/experiments/interpretation/dutchf3_patch/local/configs/unet.yaml
+++ b/experiments/interpretation/dutchf3_patch/local/configs/unet.yaml
@ -17,6 +17,8 @@ DATASET:
  NUM_CLASSES: 6
  ROOT: '/home/username/data/dutch/data'
  CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+  MIN: -1
+  MAX: 1

 MODEL:
  NAME: resnet_unet
--- a/experiments/interpretation/dutchf3_patch/local/default.py
+++ b/experiments/interpretation/dutchf3_patch/local/default.py
@ -37,6 +37,8 @@ _C.DATASET = CN()
 _C.DATASET.ROOT = ""
 _C.DATASET.NUM_CLASSES = 6
 _C.DATASET.CLASS_WEIGHTS = [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
+_C.DATASET.MIN = -1
+_C.DATASET.MAX = 1

 # common params for NETWORK
 _C.MODEL = CN()
--- a/experiments/interpretation/dutchf3_patch/local/test.sh
+++ b/experiments/interpretation/dutchf3_patch/local/test.sh
@ -1,2 +0,0 @@
-#!/bin/bash
-python test.py --cfg "configs/seresnet_unet.yaml"
--- a/experiments/interpretation/dutchf3_patch/local/train.sh
+++ b/experiments/interpretation/dutchf3_patch/local/train.sh
@ -1,2 +0,0 @@
-#!/bin/bash
-python train.py --cfg "configs/seresnet_unet.yaml"
--- a/experiments/interpretation/dutchf3_patch/local/logging.conf
+++ b/experiments/interpretation/dutchf3_patch/local/logging.conf
--- a/experiments/interpretation/dutchf3_patch/local/test.py
+++ b/experiments/interpretation/dutchf3_patch/local/test.py
@ -201,10 +201,23 @@ def _output_processing_pipeline(config, output):


 def _patch_label_2d(
-    model, img, pre_processing, output_processing, patch_size, stride, batch_size, device, num_classes, split, debug
+    model,
+    img,
+    pre_processing,
+    output_processing,
+    patch_size,
+    stride,
+    batch_size,
+    device,
+    num_classes,
+    split,
+    debug,
+    MIN,
+    MAX,
 ):
    """Processes a whole section
    """
+
    img = torch.squeeze(img)
    h, w = img.shape[-2], img.shape[-1]  # height and width

@ -228,19 +241,19 @@ def _patch_label_2d(

        # dump the data right before it's being put into the model and after scoring
        if debug:
-            outdir = f"debug/batch_{split}"
+            outdir = f"debug/test/batch_{split}"
            generate_path(outdir)
            for i in range(batch.shape[0]):
                path_prefix = f"{outdir}/{batch_indexes[i][0]}_{batch_indexes[i][1]}"
                model_output = model_output.detach().cpu()
                # save image:
-                image_to_disk(np.array(batch[i, 0, :, :]), path_prefix + "_img.png")
+                image_to_disk(np.array(batch[i, 0, :, :]), path_prefix + "_img.png", MIN, MAX)
                # dump model prediction:
                mask_to_disk(model_output[i, :, :, :].argmax(dim=0).numpy(), path_prefix + "_pred.png", num_classes)
                # dump model confidence values
                for nclass in range(num_classes):
                    image_to_disk(
-                        model_output[i, nclass, :, :].numpy(), path_prefix + f"_class_{nclass}_conf.png",
+                        model_output[i, nclass, :, :].numpy(), path_prefix + f"_class_{nclass}_conf.png", MIN, MAX
                    )

    # crop the output_p in the middle
@ -249,46 +262,56 @@ def _patch_label_2d(


 def _evaluate_split(
-    split, section_aug, model, pre_processing, output_processing, device, running_metrics_overall, config, debug=False,
+    split,
+    section_aug,
+    model,
+    pre_processing,
+    output_processing,
+    device,
+    running_metrics_overall,
+    config,
+    data_flow,
+    debug=False,
 ):
    logger = logging.getLogger(__name__)

    TestSectionLoader = get_test_loader(config)

-    test_set = TestSectionLoader(
-        config.DATASET.ROOT,
-        config.DATASET.NUM_CLASSES,
-        split=split,
-        is_transform=True,
-        augmentations=section_aug,
-        debug=debug,
-    )
+    test_set = TestSectionLoader(config, split=split, is_transform=True, augmentations=section_aug, debug=debug,)

    n_classes = test_set.n_classes

+    if debug:
+        data_flow[split] = dict()
+        data_flow[split]["test_section_loader_length"] = len(test_set)
+        data_flow[split]["test_input_shape"] = test_set.seismic.shape
+        data_flow[split]["test_label_shape"] = test_set.labels.shape
+        data_flow[split]["n_classes"] = n_classes
+
    test_loader = data.DataLoader(test_set, batch_size=1, num_workers=config.WORKERS, shuffle=False)

    if debug:
+        data_flow[split]["test_loader_length"] = len(test_loader)
        logger.info("Running in Debug/Test mode")
-        test_loader = take(2, test_loader)
+        take_n = 2
+        test_loader = take(take_n, test_loader)
+        data_flow[split]["take_n_sections"] = take_n
+        pred_list, gt_list, img_list = [], [], []

    try:
        output_dir = generate_path(
-            f"debug/{config.OUTPUT_DIR}_test_{split}", git_branch(), git_hash(), config.MODEL.NAME, current_datetime(),
+            f"{config.OUTPUT_DIR}/test/{split}", git_branch(), git_hash(), config.MODEL.NAME, current_datetime(),
        )
    except:
-        output_dir = generate_path(f"debug/{config.OUTPUT_DIR}_test_{split}", config.MODEL.NAME, current_datetime(),)
+        output_dir = generate_path(f"{config.OUTPUT_DIR}/test/{split}", config.MODEL.NAME, current_datetime(),)

    running_metrics_split = runningScore(n_classes)

    # evaluation mode:
    with torch.no_grad():  # operations inside don't track history
        model.eval()
-        total_iteration = 0
        for i, (images, labels) in enumerate(test_loader):
            logger.info(f"split: {split}, section: {i}")
-            total_iteration = total_iteration + 1
-
            outputs = _patch_label_2d(
                model,
                images,
@ -301,10 +324,17 @@ def _evaluate_split(
                n_classes,
                split,
                debug,
+                config.DATASET.MIN,
+                config.DATASET.MAX,
            )

            pred = outputs.detach().max(1)[1].numpy()
            gt = labels.numpy()
+            if debug:
+                pred_list.append((pred.shape, len(np.unique(pred))))
+                gt_list.append((gt.shape, len(np.unique(gt))))
+                img_list.append(images.numpy().shape)
+
            running_metrics_split.update(gt, pred)
            running_metrics_overall.update(gt, pred)

@ -312,6 +342,11 @@ def _evaluate_split(
            mask_to_disk(pred.squeeze(), os.path.join(output_dir, f"{i}_pred.png"), n_classes)
            mask_to_disk(gt.squeeze(), os.path.join(output_dir, f"{i}_gt.png"), n_classes)

+    if debug:
+        data_flow[split]["pred_shape"] = pred_list
+        data_flow[split]["gt_shape"] = gt_list
+        data_flow[split]["img_shape"] = img_list
+
    # get scores
    score, class_iou = running_metrics_split.get_scores()

@ -362,12 +397,14 @@ def test(*options, cfg=None, debug=False):
    load_log_configuration(config.LOG_CONFIG)
    logger = logging.getLogger(__name__)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    log_dir, model_name = os.path.split(config.TEST.MODEL_PATH)
+    log_dir, _ = os.path.split(config.TEST.MODEL_PATH)

    # load model:
    model = getattr(models, config.MODEL.NAME).get_seg_model(config)
-    model.load_state_dict(torch.load(config.TEST.MODEL_PATH), strict=False)
-    model = model.to(device)  # Send to GPU if available
+    trained_model = torch.load(config.TEST.MODEL_PATH)
+    trained_model = {k.replace("module.", ""): v for (k, v) in trained_model.items()}
+    model.load_state_dict(trained_model, strict=True)
+    model = model.to(device)

    running_metrics_overall = runningScore(n_classes)

@ -395,6 +432,7 @@ def test(*options, cfg=None, debug=False):
    output_processing = _output_processing_pipeline(config)

    splits = ["test1", "test2"] if "Both" in config.TEST.SPLIT else [config.TEST.SPLIT]
+    data_flow = dict()
    for sdx, split in enumerate(splits):
        labels = np.load(path.join(config.DATASET.ROOT, "test_once", split + "_labels.npy"))
        section_file = path.join(config.DATASET.ROOT, "splits", "section_" + split + ".txt")
@ -408,9 +446,17 @@ def test(*options, cfg=None, debug=False):
            device,
            running_metrics_overall,
            config,
+            data_flow,
            debug=debug,
        )

+    if debug:
+        config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
+
+        fname = f"data_flow_test_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
+        with open(fname, "w") as f:
+            json.dump(data_flow, f, indent=1)
+
    # FINAL TEST RESULTS:
    score, class_iou = running_metrics_overall.get_scores()

@ -433,7 +479,6 @@ def test(*options, cfg=None, debug=False):
    np.savetxt(path.join(log_dir, "confusion.csv"), confusion, delimiter=" ")

    if debug:
-        config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
        fname = f"metrics_test_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
        with open(fname, "w") as fid:
            json.dump(
--- a/experiments/interpretation/dutchf3_patch/test.sh
+++ b/experiments/interpretation/dutchf3_patch/test.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+python test.py --cfg "configs/seresnet_unet.yaml"
--- a/experiments/interpretation/dutchf3_patch/local/train.py
+++ b/experiments/interpretation/dutchf3_patch/local/train.py
@ -15,17 +15,20 @@ Time to run on single V100 for 300 epochs: 4.5 days
 import json
 import logging
 import logging.config
+
+import os
 from os import path

 import fire
 import numpy as np
 import torch
-from torch.utils import data
 from albumentations import Compose, HorizontalFlip, Normalize, PadIfNeeded, Resize
-from ignite.contrib.handlers import CosineAnnealingScheduler
+from ignite.contrib.handlers import ConcatScheduler, CosineAnnealingScheduler, LinearCyclicalScheduler
 from ignite.engine import Events
 from ignite.metrics import Loss
 from ignite.utils import convert_tensor
+from toolz import curry
+from torch.utils import data

 from cv_lib.event_handlers import SnapshotHandler, logging_handlers, tensorboard_handlers
 from cv_lib.event_handlers.tensorboard_handlers import create_summary_writer
@ -33,7 +36,7 @@ from cv_lib.segmentation import extract_metric_from, models
 from cv_lib.segmentation.dutchf3.engine import create_supervised_evaluator, create_supervised_trainer
 from cv_lib.segmentation.dutchf3.utils import current_datetime, git_branch, git_hash
 from cv_lib.segmentation.metrics import class_accuracy, class_iou, mean_class_accuracy, mean_iou, pixelwise_accuracy
-from cv_lib.utils import load_log_configuration, generate_path
+from cv_lib.utils import generate_path, load_log_configuration
 from deepseismic_interpretation.dutchf3.data import get_patch_loader
 from default import _C as config
 from default import update_config
@ -47,7 +50,12 @@ def prepare_batch(batch, device=None, non_blocking=False):
    )


-def run(*options, cfg=None, debug=False):
+@curry
+def update_sampler_epoch(data_loader, engine):
+    data_loader.sampler.epoch = engine.state.epoch
+
+
+def run(*options, cfg=None, local_rank=0, debug=False, input=None, distributed=False):
    """Run training and validation of model

    Notes:
@ -62,30 +70,43 @@ def run(*options, cfg=None, debug=False):
                                      default.py
        cfg (str, optional): Location of config file to load. Defaults to None.        
        debug (bool): Places scripts in debug/test mode and only executes a few iterations
+        input (str, optional): Location of data if Azure ML run, 
+            for local runs input is config.DATASET.ROOT
+        distributed (bool): This flag tells the training script to run in distributed mode
+            if more than one GPU exists.
    """
-    # Configuration:
-    update_config(config, options=options, config_file=cfg)

-    # The model will be saved under: outputs/<config_file_name>/<model_dir>
-    config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
-    try:
-        output_dir = generate_path(
-            config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),
-        )
-    except:
-        output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
+    # if AML training pipeline supplies us with input
+    if input is not None:
+        data_dir = input
+        output_dir = data_dir + config.OUTPUT_DIR

-    # Logging:
+    # Start logging
    load_log_configuration(config.LOG_CONFIG)
    logger = logging.getLogger(__name__)
    logger.debug(config.WORKERS)

+    # Configuration:
+    update_config(config, options=options, config_file=cfg)
+    silence_other_ranks = True
+
+    world_size = int(os.environ.get("WORLD_SIZE", 1))
+    distributed = world_size > 1
+
+    if distributed:
+        # FOR DISTRIBUTED: Set the device according to local_rank.
+        torch.cuda.set_device(local_rank)
+
+        # FOR DISTRIBUTED: Initialize the backend. torch.distributed.launch will
+        # provide environment variables, and requires that you use init_method=`env://`.
+        torch.distributed.init_process_group(backend="nccl", init_method="env://")
+        logging.info(f"Started train.py using distributed mode.")
+    else:
+        logging.info(f"Started train.py using local mode.")
+
    # Set CUDNN benchmark mode:
    torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK

-    # We will write the model under outputs / config_file_name / model_dir
-    config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
-
    # Fix random seeds:
    torch.manual_seed(config.SEED)
    if torch.cuda.is_available():
@ -125,41 +146,51 @@ def run(*options, cfg=None, debug=False):
    # Training and Validation Loaders:
    TrainPatchLoader = get_patch_loader(config)
    logging.info(f"Using {TrainPatchLoader}")
-    train_set = TrainPatchLoader(
-        config.DATASET.ROOT,
-        config.DATASET.NUM_CLASSES,
-        split="train",
-        is_transform=True,
-        stride=config.TRAIN.STRIDE,
-        patch_size=config.TRAIN.PATCH_SIZE,
-        augmentations=train_aug,
-        debug=debug,
-    )
+
+    train_set = TrainPatchLoader(config, split="train", is_transform=True, augmentations=train_aug, debug=debug,)
    logger.info(train_set)
+
    n_classes = train_set.n_classes
-    val_set = TrainPatchLoader(
-        config.DATASET.ROOT,
-        config.DATASET.NUM_CLASSES,
-        split="val",
-        is_transform=True,
-        stride=config.TRAIN.STRIDE,
-        patch_size=config.TRAIN.PATCH_SIZE,
-        augmentations=val_aug,
-        debug=debug,
-    )
+    val_set = TrainPatchLoader(config, split="val", is_transform=True, augmentations=val_aug, debug=debug,)
+
    logger.info(val_set)

    if debug:
+        data_flow_dict = dict()
+
+        data_flow_dict["train_patch_loader_length"] = len(train_set)
+        data_flow_dict["validation_patch_loader_length"] = len(val_set)
+        data_flow_dict["train_input_shape"] = train_set.seismic.shape
+        data_flow_dict["train_label_shape"] = train_set.labels.shape
+        data_flow_dict["n_classes"] = n_classes
+
        logger.info("Running in debug mode..")
-        train_set = data.Subset(train_set, range(config.TRAIN.BATCH_SIZE_PER_GPU * config.NUM_DEBUG_BATCHES))
-        val_set = data.Subset(val_set, range(config.VALIDATION.BATCH_SIZE_PER_GPU))
+        train_range = min(config.TRAIN.BATCH_SIZE_PER_GPU * config.NUM_DEBUG_BATCHES, len(train_set))
+        logging.info(f"train range in debug mode {train_range}")
+        train_set = data.Subset(train_set, range(train_range))
+        valid_range = min(config.VALIDATION.BATCH_SIZE_PER_GPU, len(val_set))
+        val_set = data.Subset(val_set, range(valid_range))
+
+        data_flow_dict["train_length_subset"] = len(train_set)
+        data_flow_dict["validation_length_subset"] = len(val_set)
+
+    train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)
+    val_sampler = torch.utils.data.distributed.DistributedSampler(val_set, num_replicas=world_size, rank=local_rank)

    train_loader = data.DataLoader(
-        train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, shuffle=True
+        train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=train_sampler,
    )
    val_loader = data.DataLoader(
-        val_set, batch_size=config.VALIDATION.BATCH_SIZE_PER_GPU, num_workers=1
-    )  # config.WORKERS)
+        val_set, batch_size=config.VALIDATION.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=val_sampler
+    )
+
+    if debug:
+        data_flow_dict["train_loader_length"] = len(train_loader)
+        data_flow_dict["validation_loader_length"] = len(val_loader)
+        config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
+        fname = f"data_flow_train_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
+        with open(fname, "w") as f:
+            json.dump(data_flow_dict, f, indent=2)

    # Model:
    model = getattr(models, config.MODEL.NAME).get_seg_model(config)
@ -176,12 +207,26 @@ def run(*options, cfg=None, debug=False):

    epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
    snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2 * len(train_loader)
-    scheduler = CosineAnnealingScheduler(
-        optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, cycle_size=snapshot_duration
+    cosine_scheduler = CosineAnnealingScheduler(
+        optimizer,
+        "lr",
+        config.TRAIN.MAX_LR * world_size,
+        config.TRAIN.MIN_LR * world_size,
+        cycle_size=snapshot_duration,
    )

-    # Tensorboard writer:
-    summary_writer = create_summary_writer(log_dir=path.join(output_dir, "logs"))
+    if distributed:
+        warmup_duration = 5 * len(train_loader)
+        warmup_scheduler = LinearCyclicalScheduler(
+            optimizer,
+            "lr",
+            start_value=config.TRAIN.MAX_LR,
+            end_value=config.TRAIN.MAX_LR * world_size,
+            cycle_size=10 * len(train_loader),
+        )
+        scheduler = ConcatScheduler(schedulers=[warmup_scheduler, cosine_scheduler], durations=[warmup_duration])
+    else:
+        scheduler = cosine_scheduler

    # class weights are inversely proportional to the frequency of the classes in the training set
    class_weights = torch.tensor(config.DATASET.CLASS_WEIGHTS, device=device, requires_grad=False)
@ -189,70 +234,97 @@ def run(*options, cfg=None, debug=False):
    # Loss:
    criterion = torch.nn.CrossEntropyLoss(weight=class_weights, ignore_index=255, reduction="mean")

+    # Model:
+    if distributed:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device], find_unused_parameters=True)
+        if silence_other_ranks & local_rank != 0:
+            logging.getLogger("ignite.engine.engine.Engine").setLevel(logging.WARNING)
+
    # Ignite trainer and evaluator:
    trainer = create_supervised_trainer(model, optimizer, criterion, prepare_batch, device=device)
+    trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)
+    # Set to update the epoch parameter of our distributed data sampler so that we get
+    # different shuffles
+    trainer.add_event_handler(Events.EPOCH_STARTED, update_sampler_epoch(train_loader))
+
    transform_fn = lambda output_dict: (output_dict["y_pred"].squeeze(), output_dict["mask"].squeeze())
    evaluator = create_supervised_evaluator(
        model,
        prepare_batch,
        metrics={
-            "nll": Loss(criterion, output_transform=transform_fn),
+            "nll": Loss(criterion, output_transform=transform_fn, device=device),
            "pixacc": pixelwise_accuracy(n_classes, output_transform=transform_fn, device=device),
-            "cacc": class_accuracy(n_classes, output_transform=transform_fn),
-            "mca": mean_class_accuracy(n_classes, output_transform=transform_fn),
-            "ciou": class_iou(n_classes, output_transform=transform_fn),
-            "mIoU": mean_iou(n_classes, output_transform=transform_fn),
+            "cacc": class_accuracy(n_classes, output_transform=transform_fn, device=device),
+            "mca": mean_class_accuracy(n_classes, output_transform=transform_fn, device=device),
+            "ciou": class_iou(n_classes, output_transform=transform_fn, device=device),
+            "mIoU": mean_iou(n_classes, output_transform=transform_fn, device=device),
        },
        device=device,
    )
-    trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

-    # Logging:
-    trainer.add_event_handler(
-        Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.PRINT_FREQ),
-    )
-    trainer.add_event_handler(Events.EPOCH_COMPLETED, logging_handlers.log_lr(optimizer))
+    # The model will be saved under: outputs/<config_file_name>/<model_dir>
+    config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
+    try:
+        output_dir = generate_path(
+            config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),
+        )
+    except:
+        output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)

-    # Tensorboard and Logging:
-    trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_training_output(summary_writer))
-    trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_validation_output(summary_writer))
+    if local_rank == 0:  # Run only on master process
+        # Logging:
+        trainer.add_event_handler(
+            Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.PRINT_FREQ),
+        )
+        trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))
+
+        # Checkpointing: snapshotting trained models to disk
+        checkpoint_handler = SnapshotHandler(
+            output_dir,
+            config.MODEL.NAME,
+            extract_metric_from("mIoU"),
+            lambda: (trainer.state.iteration % snapshot_duration) == 0,
+        )
+
+        evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})
+
+        # Tensorboard and Logging:
+        summary_writer = create_summary_writer(log_dir=path.join(output_dir, "logs"))
+        trainer.add_event_handler(Events.EPOCH_STARTED, tensorboard_handlers.log_lr(summary_writer, optimizer, "epoch"))
+        trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_training_output(summary_writer))
+        trainer.add_event_handler(
+            Events.ITERATION_COMPLETED, tensorboard_handlers.log_validation_output(summary_writer)
+        )

-    # add specific logger which also triggers printed metrics on training set
    @trainer.on(Events.EPOCH_COMPLETED)
    def log_training_results(engine):
        evaluator.run(train_loader)
-        tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Training")
-        logging_handlers.log_metrics(engine, evaluator, stage="Training")
+        if local_rank == 0:  # Run only on master process
+            tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Training")
+            logging_handlers.log_metrics(engine, evaluator, stage="Training")
+            logger.info("Logging training results..")

-    # add specific logger which also triggers printed metrics on validation set
    @trainer.on(Events.EPOCH_COMPLETED)
    def log_validation_results(engine):
        evaluator.run(val_loader)
-        tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Validation")
-        logging_handlers.log_metrics(engine, evaluator, stage="Validation")
-        # dump validation set metrics at the very end for debugging purposes
-        if engine.state.epoch == config.TRAIN.END_EPOCH and debug:
-            fname = f"metrics_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
-            metrics = evaluator.state.metrics
-            out_dict = {x: metrics[x] for x in ["nll", "pixacc", "mca", "mIoU"]}
-            with open(fname, "w") as fid:
-                json.dump(out_dict, fid)
-            log_msg = " ".join(f"{k}: {out_dict[k]}" for k in out_dict.keys())
-            logging.info(log_msg)
-
-    # Checkpointing: snapshotting trained models to disk
-    checkpoint_handler = SnapshotHandler(
-        output_dir,
-        config.MODEL.NAME,
-        extract_metric_from("mIoU"),
-        lambda: (trainer.state.iteration % snapshot_duration) == 0,
-    )
-    evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})
+        if local_rank == 0:  # Run only on master process
+            tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Validation")
+            logging_handlers.log_metrics(engine, evaluator, stage="Validation")
+            logger.info("Logging validation results..")
+            # dump validation set metrics at the very end for debugging purposes
+            if engine.state.epoch == config.TRAIN.END_EPOCH and debug:
+                fname = f"metrics_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
+                metrics = evaluator.state.metrics
+                out_dict = {x: metrics[x] for x in ["nll", "pixacc", "mca", "mIoU"]}
+                with open(fname, "w") as fid:
+                    json.dump(out_dict, fid)
+                log_msg = " ".join(f"{k}: {out_dict[k]}" for k in out_dict.keys())
+                logging.info(log_msg)

    logger.info("Starting training")
    trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length=len(train_loader), seed=config.SEED)
-
-    summary_writer.close()
+    if local_rank == 0:
+        summary_writer.close()


 if __name__ == "__main__":
--- a/experiments/interpretation/dutchf3_patch/train.sh
+++ b/experiments/interpretation/dutchf3_patch/train.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+nohup python train.py --cfg "configs/seresnet_unet.yaml" > train.log 2>&1
--- a/experiments/interpretation/dutchf3_patch/train_distributed.sh
+++ b/experiments/interpretation/dutchf3_patch/train_distributed.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+NGPUS=$(nvidia-smi -L | wc -l)
+if [ "$NGPUS" -lt "2" ]; then
+    echo "ERROR: cannot run distributed training without 2 or more GPUs."
+    exit 1
+fi
+nohup python -m torch.distributed.launch --nproc_per_node=$NGPUS train.py \
+     --distributed --cfg "configs/seresnet_unet.yaml" > train_distributed.log 2>&1 &
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/README.md
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/README.md
@ -0,0 +1,192 @@
+# Integrating with AzureML
+
+## AzureML Pipeline Background 
+
+Azure Machine Learning is a cloud-based environment you can use to train, deploy, automate, manage, and track ML models.
+
+An Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task. Subtasks are encapsulated as a series of steps within the pipeline. An Azure Machine Learning pipeline can be as simple as one that calls a Python script, so may do just about anything. Pipelines should focus on machine learning tasks such as:
+
+- Data preparation including importing, validating and cleaning, munging and transformation, normalization, and staging
+- Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
+- Training and validating efficiently and repeatedly. Efficiency might come from specifying specific data subsets, different hardware compute resources, distributed processing, and progress monitoring
+- Deployment, including versioning, scaling, provisioning, and access control
+
+An Azure ML pipeline performs a complete logical workflow with an ordered sequence of steps. Each step is a discrete processing action. Pipelines run in the context of an Azure Machine Learning Experiment.
+In the early stages of an ML project, it's fine to have a single Jupyter notebook or Python script that does all the work of Azure workspace and resource configuration, data preparation, run configuration, training, and validation. But just as functions and classes quickly become preferable to a single imperative block of code, ML workflows quickly become preferable to a monolithic notebook or script.
+By modularizing ML tasks, pipelines support the Computer Science imperative that a component should "do (only) one thing well." Modularity is clearly vital to project success when programming in teams, but even when working alone, even a small ML project involves separate tasks, each with a good amount of complexity. Tasks include: workspace configuration and data access, data preparation, model definition and configuration, and deployment. While the outputs of one or more tasks form the inputs to another, the exact implementation details of any one task are, at best, irrelevant distractions in the next. At worst, the computational state of one task can cause a bug in another.
+
+There are many ways to leverage AzureML. Currently DeepSeismic has integrated with AzureML to train a pipeline, which will include creating an experiment titled "DEV-train-pipeline" which will contain all training runs, associated logs, and the ability to navigate seemlessly through this information. AzureML will take data from a blob storage account and the associated models will be saved to this account upon completion of the run.
+
+Please refer to microsoft docs for additional information on AzureML pipelines and related capabilities ['What are Azure Machine Learning pipelines?'](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines) 
+
+## Files needed for this AzureML run
+
+You will need the following files to complete an run in AzureML
+
+- [.azureml/config.json](../../../.azureml.example/config.json) This is used to import your subscription, resource group, and AzureML workspace
+- [.env](../../../.env.example) This is used to import your environment variables including blob storage information and AzureML compute cluster specs
+- [kickoff_train_pipeline.py](dev/kickoff_train_pipeline.py) This script shows how to run an AzureML train pipeline 
+- [cancel_run.py](dev/cancel_run.py) This script is used to cancel an AzureML train pipeline run
+- [base_pipeline.py](base_pipeline.py) This script is used as a base class and train_pipeline.py inherits from it. This is intended to be a helpful abstraction that an an future addition of an inference pipeline can leverage
+- [train_pipeline.py](train_pipeline.py) This script inherts from base_pipeline.py and is used to construct the pipeline and its steps. The script kickoff_train_pipeline.py will call the function defined here and the pipeline_config
+- [pipeline_config.json](pipeline_config.json) This pipeline configuration specifies the steps of the pipeline, location of data, and any specific arguments. This is consumed once the kickoff_train_script.py is run
+- [train.py](../../../experiments/interpretation/dutchf3_patch/train.py) This is the training script that is used to train the model
+- [unet.yaml](../../../experiments/interpretation/dutchf3_patch/configs/unet.yaml) This config specifices the model configuration to be used in train.py and is referenced in the pipeline_config.json
+- [azureml_requirements.txt](../../../experiments/interpretation/dutchf3_patch/azureml_requirements.txt) This file holds all dependencies for train.py so they can be installed on the compute in Azure ML
+- [logging.config](../../../experiments/interpretation/dutchf3_patch/logging.config) This logging config is used to set up logging
+- local environment with cv_lib and interpretation set up using guidance [here](../../../README.md)
+
+## Running a Pipeline in AzureML
+
+Go into the [Azure Portal](https://portal.azure.com) and create a blob storage. Once you have created a [blob storage](https://azure.microsoft.com/en-us/services/storage/blobs/) you may use [Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/vs-azure-tools-storage-manage-with-storage-explorer?tabs=windows) to manage your blob instance. You can either manually upload data through Azure Storage Explorer, or you can use [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10) to migrate the data to your blob storage. Once you blob storage set up and the data migrated, you may being to fill in the environemnt variables below. There is a an example [.env file](../../../.env.example) that you may leverage. More information on how to activate these environment variables are below.
+
+With your run you will need to specifiy the below compute. Once you populate these variables, AzureML will spin up a run based creation compute, this means that the compute will be created by AzureML at run time specifically for your run. The compute is deleted automatically once the run completes. With AzureML you also have the option of creating and attaching your own compute. For more information on run-based compute creation and persistent compute please refer to the [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets) section in Microsoft docs.
+
+`AML_COMPUTE_CLUSTER_SKU` refers to VM family of the nodes created by Azure Machine Learning Compute. If not specified, defaults to Standard_NC6. For compute options see [HardwareProfile object values](https://docs.microsoft.com/en-us/azure/templates/Microsoft.Compute/2019-07-01/virtualMachines?toc=%2Fen-us%2Fazure%2Fazure-resource-manager%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json#hardwareprofile-object
+)
+`AML_COMPUTE_CLUSTER_MAX_NODES` refers to the max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute. This is not the max number of nodes for multi-node training, instead this is for the amount of nodes available to process single-node jobs.
+
+If you would like additional information with regards AzureML compute provisioning class please refer to Microsoft docs on [AzureML compute provisioning class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputeprovisioningconfiguration?view=azure-ml-py)
+
+Set the following environment variables:
+```
+BLOB_ACCOUNT_NAME
+BLOB_CONTAINER_NAME
+BLOB_ACCOUNT_KEY
+BLOB_SUB_ID
+AML_COMPUTE_CLUSTER_NAME
+AML_COMPUTE_CLUSTER_MIN_NODES
+AML_COMPUTE_CLUSTER_MAX_NODES
+AML_COMPUTE_CLUSTER_SKU
+```
+
+On Linux:
+`export VARIABLE=value`
+Our code can pick the environment variables from the .env file; alternatively you can `source .env` to activate these variables in your environment. An example .env file is found at the ROOT of this repo [here](../../../.env.example). You can rename this to .env. Feel free to use this as your .env file but be sure to add this to your .gitignore to ensure you do not commit any secrets. 
+
+You will be able to download a config.json that will already have your subscription id, resource group, and workspace name directly in the [Azure Portal](https://portal.azure.com). You will want to navigate to your AzureML workspace and then you can click the `Download config.json` option towards the top left of the browser. Once you do this you can rename the .azureml.example folder to .azureml and replace the config.json with your downloaded config.json. If you would prefer to migrate the information manually refer to the guidance below.
+
+Create a .azureml/config.json file in the project's root directory that looks like so:
+```json
+{
+"subscription_id": "<subscription id>",
+"resource_group": "<resource group>",
+"workspace_name": "<workspace name>"
+}
+
+```
+At the ROOT of this repo you will find an example [here](../../../.azureml.example/config.json). This is an example please rename the file to .azureml/config.json, input your account information and add this to your .gitignore. 
+
+
+## Training Pipeline
+Here's an example of a possible pipeline configuration file:
+```json
+{
+    "step1":
+    {
+        "type": "MpiStep",
+        "name": "train step",
+        "script": "train.py",
+        "input_datareference_path": "normalized_data/",
+        "input_datareference_name": "normalized_data_conditioned",
+        "input_dataset_name": "normalizeddataconditioned",
+        "source_directory": "train/",
+        "arguments": ["--splits", "splits",
+        "--train_data_paths", "normalized_data/file.npy",
+        "--label_paths", "label.npy"],
+        "requirements": "train/requirements.txt",
+        "node_count": 1,
+        "processes_per_node": 1,
+        "base_image": "pytorch/pytorch"
+    }
+}
+```
+  
+If you want to create a train pipeline:
+1) All of your steps are isolated
+    - Your scripts will need to conform to the interface you define in the pipeline configuration file
+        - I.e., if step1 is expected to output X and step 2 is expecting X as an input, your scripts need to reflect that
+    - If one of your steps has pip package dependencies, make sure it's specified in a requirements.txt file
+    - If your script has local dependencies (i.e., is importing from another script) make sure that all dependencies fall underneath the source_directory
+2) You have configured your pipeline configuration file to specify the steps needed (see the section below "Configuring a Pipeline" for guidance)
+
+Note: the following arguments are automatically added to any script steps by AzureML:
+```--input_data``` and ```--output``` (if output is specified in the pipeline_config.json)
+Make sure to add these arguments in your scripts like so:
+```python
+parser.add_argument('--input_data', type=str, help='path to preprocessed data')
+parser.add_argument('--output', type=str, help='output from training')
+```
+```input_data``` is the absolute path to the input_datareference_path on the blob you specified.
+  
+# Configuring a Pipeline
+  
+## Train Pipeline
+Define parameters for the run in a pipeline configuration file. See an example in this repo [here](pipeline_config.json). For additional guidance on [pipeline steps](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#steps) please refer to Microsoft docs.
+```json
+{
+    "step1":
+    {
+        "type": "<type of step. Supported types include PythonScriptStep and MpiStep>",
+        "name": "<name in AzureML for this step>",
+        "script": "<path to script for this step>",
+        "output": "<name of the output in AzureML for this step - optional>",
+        "input_datareference_path": "<path on the data reference for the input data - optional>",
+        "input_datareference_name": "<name of the data reference in AzureML where the input data lives - optional>",
+        "input_dataset_name": "<name of the datastore in AzureML - optional>",
+        "source_directory": "<source directory containing the files for this step>",
+        "arguments": "<arguments to pass to the script - optional>",
+        "requirements": "<path to the requirements.txt file for the step - optional>",
+        "node_count": "<number of nodes to run the script on - optional>",
+        "processes_per_node": "<number of processes to run on each node - optional>",
+        "base_image": "<name of an image registered on dockerhub that you want to use as your base image"
+    },
+  
+    "step2":
+    {
+        .
+        .
+        .
+    }
+}
+```
+  
+## Kicking off a Pipeline
+In order to kick off a pipeline, you will need to use the AzureCLI to login to the subscription where your workspace resides. Once you successfully log in, there will be a print out of all of the subscriptions you have access to. You can either get your subscription id this way or you could go directly to the azure portal, navigate to your subscriptions, and then locate the right subscription id to pass into az account set -s:
+```bash
+az login
+az account set -s <subscription id>
+```
+Kick off the training pipeline defined in your config via your python environment of choice. First activate your local environment that has cv_lib and interpretation set up using guidance [here](../../../README.md). You will run the kick off for the training pipeline from the ROOT directory. The code will look like this:
+```python
+from src.azml.train_pipeline.train_pipeline import TrainPipeline
+
+orchestrator = TrainPipeline("<path to your pipeline configuration file>")
+orchestrator.construct_pipeline()
+run = orchestrator.run_pipeline(experiment_name="DEV-train-pipeline")
+```
+See an example in [dev/kickoff_train_pipeline.py](dev/kickoff_train_pipeline.py)
+
+If you run into a subscription access error you might a work around in [Troubleshooting](##troubleshooting) section.
+  
+## Cancelling a Pipeline Run
+If you kicked off a pipeline and want to cancel it, run the [cancel_run.py](dev/cancel_run.py) script with the corresponding run_id and step_id. The corresponding run_id and step_id will be printed once you have run the script. You can also find this information when viewing your run in the portal https://portal.azure.com/. If you would prefer to cancel your run in the portal you may also do this as well.
+
+## Troubleshooting
+
+If you run into issues gaining access to the Azure ML subscription, you may be able to connect by using a workaround:
+Go to [base_pipeline.py](../base_pipeline.py) and add the following import:
+```python
+from azureml.core.authentication import AzureCliAuthentication
+```
+Then find the code where we connect to the workspace which looks like this:
+```python
+self.ws = Workspace.from_config(path=ws_config)
+```
+and replace it with  this:
+```python
+cli_auth = AzureCliAuthentication()
+self.ws = Workspace(subscription_id=<subscription id>, resource_group=<resource group>, workspace_name=<workspace name>, auth=cli_auth)
+```
+to get this to run, you will also need to `pip install azure-cli-core`
+Then you can go back and follow the instructions above, including az login and setting the subscription, and kick off the pipeline.
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/init.py
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/base_pipeline.py
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/base_pipeline.py
@ -0,0 +1,388 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+base class for constructing and running an azureml pipeline and some of the
+accompanying resources.
+"""
+from azureml.core import Datastore, Workspace, RunConfiguration
+from azureml.core.model import Model
+from azureml.core.compute import AmlCompute, ComputeTarget
+from azureml.core.dataset import Dataset
+from azureml.core.experiment import Experiment
+from azureml.pipeline.steps import PythonScriptStep, MpiStep
+from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
+from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig
+from azureml.core.runconfig import DEFAULT_GPU_IMAGE
+from azureml.core.conda_dependencies import CondaDependencies
+from msrest.exceptions import HttpOperationError
+from azureml.data.data_reference import DataReference
+from azureml.core import Environment
+from dotenv import load_dotenv
+import os
+import re
+from abc import ABC, abstractmethod
+import json
+
+
+class DeepSeismicAzMLPipeline(ABC):
+    """
+    Abstract base class for pipelines in AzureML
+    """
+
+    def __init__(self, pipeline_config, ws_config=None):
+        """
+        constructor for DeepSeismicAzMLPipeline class
+
+        :param str pipeline_config: [required] path to the pipeline config file
+        :param str ws_config: [optional] if not specified, will look for
+                              .azureml/config.json. If you have multiple config files, you
+                              can specify which workspace you want to use by passing the
+                              relative path to the config file in this constructor.
+        """
+        self.ws = Workspace.from_config(path=ws_config)
+        self._load_environment()
+        self._load_config(pipeline_config)
+        self.steps = []
+        self.pipeline_tags = None
+        self.last_output_data = None
+
+    def _load_config(self, config_path):
+        """
+        helper function for loading in pipeline config file.
+
+        :param str config_path: path to the pipeline config file
+        """
+        try:
+            with open(config_path, "r") as f:
+                self.config = json.load(f)
+        except Exception as e:
+            raise Exception("Was unable to load pipeline config file. {}".format(e))
+
+    @abstractmethod
+    def construct_pipeline(self):
+        """
+        abstract method for constructing a pipeline. Must be implemented by classes
+        that inherit from this base class.
+        """
+        raise NotImplementedError("construct_pipeline is not implemented")
+
+    @abstractmethod
+    def _setup_steps(self):
+        """
+        abstract method for setting up pipeline steps. Must be implemented by classes
+        that inherit from this base class.
+        """
+        raise NotImplementedError("setup_steps is not implemented")
+
+    def _load_environment(self):
+        """
+        loads environment variables needed for the pipeline.
+        """
+        load_dotenv()
+        self.account_name = os.getenv("BLOB_ACCOUNT_NAME")
+        self.container_name = os.getenv("BLOB_CONTAINER_NAME")
+        self.account_key = os.getenv("BLOB_ACCOUNT_KEY")
+        self.blob_sub_id = os.getenv("BLOB_SUB_ID")
+
+        self.comp_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME")
+        self.comp_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES")
+        self.comp_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES")
+        self.comp_vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU")
+
+    def _setup_model(self, model_name, model_path=None):
+        """
+        sets up the model in azureml. Either retrieves an already registered model
+        or registers a local model.
+
+        :param str model_name: [required] name of the model that you want to retrieve
+                               from the workspace or the name you want to give the local
+                               model when you register it.
+        :param str model_path: [optional] If you do not have a model registered, pass
+                               the relative path to the model locally and it will be
+                               registered.
+        """
+        models = Model.list(self.ws, name=model_name)
+        for model in models:
+            if model.name == model_name:
+                self.model = model
+                print("Found model: " + self.model.name)
+                break
+
+        if model_path is not None:
+            self.model = Model.register(model_path=model_path, model_name=model_name, workspace=self.ws)
+
+        if self.model is None:
+            raise Exception(
+                """no model was found or registered. Ensure that you
+                             have a model registered in this workspace or that
+                             you passed the path of a local model"""
+            )
+
+    def _setup_datastore(self, blob_dataset_name, output_path=None):
+        """
+        sets up the datastore in azureml. Either retrieves a pre-existing datastore
+        or registers a new one in the workspace.
+
+        :param str blob_dataset_name: [required] name of the datastore registered with the
+                                 workspace. If the datastore does not yet exist, the
+                                 name it will be registered under.
+        :param str output_path: [optional] if registering a datastore for inferencing,
+                                the output path for writing back predictions.
+        """
+        try:
+            self.blob_ds = Datastore.get(self.ws, blob_dataset_name)
+            print("Found Blob Datastore with name: %s" % blob_dataset_name)
+        except HttpOperationError:
+            self.blob_ds = Datastore.register_azure_blob_container(
+                workspace=self.ws,
+                datastore_name=blob_dataset_name,
+                account_name=self.account_name,
+                container_name=self.container_name,
+                account_key=self.account_key,
+                subscription_id=self.blob_sub_id,
+            )
+
+            print("Registered blob datastore with name: %s" % blob_dataset_name)
+        if output_path is not None:
+            self.output_dir = PipelineData(
+                name="output", datastore=self.ws.get_default_datastore(), output_path_on_compute=output_path
+            )
+
+    def _setup_dataset(self, ds_name, data_paths):
+        """
+        registers datasets with azureml workspace
+
+        :param str ds_name: [required] name to give the dataset in azureml.
+        :param str data_paths: [required] list of paths to your data on the datastore.
+        """
+        self.named_ds = []
+        count = 1
+        for data_path in data_paths:
+            curr_name = ds_name + str(count)
+            path_on_datastore = self.blob_ds.path(data_path)
+            input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
+            try:
+                registered_ds = input_ds.register(workspace=self.ws, name=curr_name, create_new_version=True)
+            except Exception as e:
+                n, v = self._parse_exception(e)
+                registered_ds = Dataset.get_by_name(self.ws, name=n, version=v)
+            self.named_ds.append(registered_ds.as_named_input(curr_name))
+            count = count + 1
+
+    def _setup_datareference(self, name, path):
+        """
+        helper function to setup a datareference object in AzureML.
+
+        :param str name: [required] name of the data reference\
+        :param str path: [required] path on the datastore where the data lives.
+        :returns: input_data
+        :rtype: DataReference
+        """
+        input_data = DataReference(datastore=self.blob_ds, data_reference_name=name, path_on_datastore=path)
+        return input_data
+
+    def _setup_pipelinedata(self, name, output_path=None):
+        """
+        helper function to setup a PipelineData object in AzureML
+
+        :param str name: [required] name of the data object in AzureML
+        :param str output_path: path on output datastore to write data to
+        :returns: output_data
+        :rtype: PipelineData
+        """
+        if output_path is not None:
+            output_data = PipelineData(
+                name=name,
+                datastore=self.blob_ds,
+                output_name=name,
+                output_mode="mount",
+                output_path_on_compute=output_path,
+                is_directory=True,
+            )
+        else:
+            output_data = PipelineData(name=name, datastore=self.ws.get_default_datastore(), output_name=name)
+        return output_data
+
+    def _setup_compute(self):
+        """
+        sets up the compute in the azureml workspace. Either retrieves a
+        pre-existing compute target or creates one (uses environment variables).
+
+        :returns: compute_target
+        :rtype: ComputeTarget
+        """
+        if self.comp_name in self.ws.compute_targets:
+            self.compute_target = self.ws.compute_targets[self.comp_name]
+            if self.compute_target and type(self.compute_target) is AmlCompute:
+                print("Found compute target: " + self.comp_name)
+        else:
+            print("creating a new compute target...")
+            p_cfg = AmlCompute.provisioning_configuration(
+                vm_size=self.comp_vm_size, min_nodes=self.comp_min_nodes, max_nodes=self.comp_max_nodes
+            )
+
+            self.compute_target = ComputeTarget.create(self.ws, self.comp_name, p_cfg)
+            self.compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
+
+            print(self.compute_target.get_status().serialize())
+        return self.compute_target
+
+    def _get_conda_deps(self, step):
+        """
+        converts requirements.txt from user into conda dependencies for AzML
+
+        :param dict step: step defined by user that we are currently building
+
+        :returns: conda_dependencies
+        :rtype: CondaDependencies
+        """
+        with open(step["requirements"], "r") as f:
+            packages = [line.strip() for line in f]
+
+        return CondaDependencies.create(pip_packages=packages)
+
+    def _setup_env(self, step):
+        """
+        sets up AzML env given requirements defined by the user
+
+        :param dict step: step defined by user that we are currently building
+
+        :returns: env
+        :rtype: Environment
+        """
+        conda_deps = self._get_conda_deps(step)
+
+        env = Environment(name=step["name"] + "_environment")
+        env.docker.enabled = True
+
+        env.docker.base_image = DEFAULT_GPU_IMAGE
+        env.spark.precache_packages = False
+        env.python.conda_dependencies = conda_deps
+        env.python.conda_dependencies.add_conda_package("pip==20.0.2")
+        return env
+
+    def _generate_run_config(self, step):
+        """
+        generates an AzML run config if the user gives specifics about requirements
+
+        :param dict step: step defined by user that we are currently building
+
+        :returns: run_config
+        :rtype: RunConfiguration
+        """
+        try:
+            conda_deps = self._get_conda_deps(step)
+            conda_deps.add_conda_package("pip==20.0.2")
+            return RunConfiguration(script=step["script"], conda_dependencies=conda_deps)
+        except KeyError:
+            return None
+
+    def _generate_parallel_run_config(self, step):
+        """
+        generates an AzML parralell run config if the user gives specifics about requirements
+
+        :param dict step: step defined by user that we are currently building
+
+        :returns: parallel_run_config
+        :rtype: ParallelRunConfig
+        """
+        return ParallelRunConfig(
+            source_directory=step["source_directory"],
+            entry_script=step["script"],
+            mini_batch_size=str(step["mini_batch_size"]),
+            error_threshold=10,
+            output_action="summary_only",
+            environment=self._setup_env(step),
+            compute_target=self.compute_target,
+            node_count=step.get("node_count", 1),
+            process_count_per_node=step.get("processes_per_node", 1),
+            run_invocation_timeout=60,
+        )
+
+    def _create_pipeline_step(self, step, arguments, input_data, output=None, run_config=None):
+        """
+        function to create an AzureML pipeline step and apend it to the list of
+        steps that will make up the pipeline.
+
+        :param dict step: [required] dictionary containing the config parameters for this step.
+        :param list arguments: [required] list of arguments to be passed to the step.
+        :param DataReference input_data: [required] the input_data in AzureML for this step.
+        :param DataReference output: [required] output location in AzureML
+        :param ParallelRunConfig run_config: [optional] the run configuration for a MpiStep
+        """
+
+        if step["type"] == "PythonScriptStep":
+            run_config = self._generate_run_config(step)
+            pipeline_step = PythonScriptStep(
+                script_name=step["script"],
+                arguments=arguments,
+                inputs=[input_data],
+                outputs=output,
+                name=step["name"],
+                compute_target=self.compute_target,
+                source_directory=step["source_directory"],
+                allow_reuse=True,
+                runconfig=run_config,
+            )
+
+        elif step["type"] == "MpiStep":
+            pipeline_step = MpiStep(
+                name=step["name"],
+                source_directory=step["source_directory"],
+                arguments=arguments,
+                inputs=[input_data],
+                node_count=step.get("node_count", 1),
+                process_count_per_node=step.get("processes_per_node", 1),
+                compute_target=self.compute_target,
+                script_name=step["script"],
+                environment_definition=self._setup_env(step),
+            )
+
+        elif step["type"] == "ParallelRunStep":
+            run_config = self._generate_parallel_run_config(step)
+
+            pipeline_step = ParallelRunStep(
+                name=step["name"],
+                models=[self.model],
+                parallel_run_config=run_config,
+                inputs=input_data,
+                output=output,
+                arguments=arguments,
+                allow_reuse=False,
+            )
+        else:
+            raise Exception("Pipeline step type {} not supported".format(step["type"]))
+
+        self.steps.append(pipeline_step)
+
+    def run_pipeline(self, experiment_name, tags=None):
+        """
+        submits batch inference pipeline as an experiment run
+
+        :param str experiment_name: [required] name of the experiment in azureml
+        :param dict tags: [optional] dictionary of tags
+        :returns: run
+        :rtype: Run
+        """
+        if tags is None:
+            tags = self.pipeline_tags
+        step_sequence = StepSequence(steps=self.steps)
+        pipeline = Pipeline(workspace=self.ws, steps=step_sequence)
+        run = Experiment(self.ws, experiment_name).submit(pipeline, tags=tags, continue_on_step_failure=False)
+        return run
+
+    def _parse_exception(self, e):
+        """
+        helper function to parse exception thrown by azureml
+
+        :param Exception e: [required] the exception to be parsed
+        :returns: name, version
+        :rtype: str, str
+        """
+        s = str(e)
+        result = re.search('name="(.*)"', s)
+        name = result.group(1)
+        version = s[s.find("version=") + 8 : s.find(")")]
+
+        return name, version
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/dev/cancel_run.py
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/dev/cancel_run.py
@ -0,0 +1,21 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+Cancel pipeline run
+"""
+from azureml.core.run import Run
+from azureml.core import Workspace, Experiment
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--run_id", type=str, help="run id value", required=True)
+parser.add_argument("--step_id", type=str, help="step id value", required=True)
+
+args = parser.parse_args()
+
+ws = Workspace.from_config()
+
+experiment = Experiment(workspace=ws, name="DEV-train-pipeline", _id=args.run_id)
+fetched_run = Run(experiment=experiment, run_id=args.step_id)
+fetched_run.cancel()
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py
@ -0,0 +1,32 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Create pipeline and kickoff run
+"""
+from deepseismic_interpretation.azureml_pipelines.train_pipeline import TrainPipeline
+import fire
+
+
+def kickoff_pipeline(
+    experiment="DEV-train-pipeline",
+    orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json",
+):
+    """Kicks off pipeline run
+
+    Args:
+        experiment (str): name of experiment
+        orchestrator_config (str): path to pipeline configuration
+    """
+    orchestrator = TrainPipeline(orchestrator_config)
+    orchestrator.construct_pipeline()
+    run = orchestrator.run_pipeline(experiment_name=experiment)
+
+
+if __name__ == "__main__":
+    """Example:
+    python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py --experiment=DEV-train-pipeline-name --orchestrator_config=orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json"
+    or
+    python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py 
+
+    """
+    fire.Fire(kickoff_pipeline)
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json
@ -0,0 +1,25 @@
+{
+    "step1": {
+        "type": "MpiStep",
+        "name": "train step",
+        "script": "train.py",
+        "input_datareference_path": "data/",
+        "input_datareference_name": "ds_test",
+        "input_dataset_name": "deepseismic_test_dataset",
+        "source_directory": "experiments/interpretation/dutchf3_patch",
+        "arguments": [
+            "--cfg",
+            "configs/unet.yaml",
+            "TRAIN.END_EPOCH",
+            "1",
+            "TRAIN.SNAPSHOTS",
+            "1",
+            "DATASET.ROOT",
+            "data"
+        ],
+        "requirements": "experiments/interpretation/dutchf3_patch/azureml_requirements.txt",
+        "node_count": 1,
+        "processes_per_node": 1,
+        "base_image": "pytorch/pytorch"
+    }
+}
--- a/interpretation/deepseismic_interpretation/azureml_pipelines/train_pipeline.py
+++ b/interpretation/deepseismic_interpretation/azureml_pipelines/train_pipeline.py
@ -0,0 +1,55 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+TrainPipeline class for setting up a training pipeline in AzureML.
+Inherits from DeepSeismicAzMLPipeline
+"""
+from deepseismic_interpretation.azureml_pipelines.base_pipeline import DeepSeismicAzMLPipeline
+
+
+class TrainPipeline(DeepSeismicAzMLPipeline):
+    def construct_pipeline(self):
+        """
+        implemented function from ABC. Sets up the pre-requisites for a pipeline.
+        """
+        self._setup_compute()
+        self._setup_datastore(blob_dataset_name=self.config["step1"]["input_dataset_name"])
+
+        self._setup_steps()
+
+    def _setup_steps(self):
+        """
+        iterates over all the steps in the config file and sets each one up along
+        with its accompanying objects.
+        """
+        for _, step in self.config.items():
+            try:
+                input_data = self._setup_datareference(
+                    name=step["input_datareference_name"], path=step["input_datareference_path"]
+                )
+            except KeyError:
+                # grab the last step's output as input for this step
+                if self.last_output_data is None:
+                    raise KeyError(
+                        "input_datareference_name and input_datareference_path can only be"
+                        "omitted if there is a previous step in the pipeline"
+                    )
+                else:
+                    input_data = self.last_output_data
+
+            try:
+                self.last_output_data = self._setup_pipelinedata(
+                    name=step["output"], output_path=step.get("output_path", None)
+                )
+            except KeyError:
+                self.last_output_data = None
+
+            script_params = step["arguments"] + ["--input", input_data]
+
+            if self.last_output_data is not None:
+                script_params = script_params + ["--output", self.last_output_data]
+
+            self._create_pipeline_step(
+                step=step, arguments=script_params, input_data=input_data, output=self.last_output_data
+            )
--- a/interpretation/deepseismic_interpretation/dutchf3/init.py
+++ b/interpretation/deepseismic_interpretation/dutchf3/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/interpretation/deepseismic_interpretation/dutchf3/data.py
+++ b/interpretation/deepseismic_interpretation/dutchf3/data.py
@ -117,20 +117,21 @@ def read_labels(fname, data_info):
 class SectionLoader(data.Dataset):
    """
    Base class for section data loader
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
    :param bool debug: enable debugging output
    """

-    def __init__(self, data_dir, n_classes, split="train", is_transform=True, augmentations=None, debug=False):
+    def __init__(self, config, split="train", is_transform=True, augmentations=None, debug=False):
+        self.data_dir = config.DATASET.ROOT
+        self.n_classes = config.DATASET.NUM_CLASSES
+        self.MIN = config.DATASET.MIN
+        self.MAX = config.DATASET.MAX
        self.split = split
-        self.data_dir = data_dir
        self.is_transform = is_transform
        self.augmentations = augmentations
-        self.n_classes = n_classes
        self.sections = list()
        self.debug = debug

@ -152,10 +153,10 @@ class SectionLoader(data.Dataset):
        im, lbl = _transform_WH_to_HW(im), _transform_WH_to_HW(lbl)

        if self.debug and "test" in self.split:
-            outdir = f"debug/sectionLoader_{self.split}_raw"
+            outdir = f"debug/test/sectionLoader_{self.split}_raw"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{section_name}"
-            image_to_disk(im, path_prefix + "_img.png")
+            image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.augmentations is not None:
@ -166,10 +167,10 @@ class SectionLoader(data.Dataset):
            im, lbl = self.transform(im, lbl)

        if self.debug and "test" in self.split:
-            outdir = f"debug/sectionLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
+            outdir = f"debug/test/sectionLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{section_name}"
-            image_to_disk(np.array(im[0]), path_prefix + "_img.png")
+            image_to_disk(np.array(im[0]), path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(np.array(lbl[0]), path_prefix + "_lbl.png", self.n_classes)

        return im, lbl
@ -185,8 +186,7 @@ class SectionLoader(data.Dataset):
 class TrainSectionLoader(SectionLoader):
    """
    Training data loader for sections
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -197,8 +197,7 @@ class TrainSectionLoader(SectionLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="train",
        is_transform=True,
        augmentations=None,
@ -207,8 +206,7 @@ class TrainSectionLoader(SectionLoader):
        debug=False,
    ):
        super(TrainSectionLoader, self).__init__(
-            data_dir,
-            n_classes,
+            config,
            split=split,
            is_transform=is_transform,
            augmentations=augmentations,
@ -240,8 +238,7 @@ class TrainSectionLoader(SectionLoader):
 class TrainSectionLoaderWithDepth(TrainSectionLoader):
    """
    Section data loader that includes additional channel for depth
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -252,8 +249,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="train",
        is_transform=True,
        augmentations=None,
@ -262,8 +258,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):
        debug=False,
    ):
        super(TrainSectionLoaderWithDepth, self).__init__(
-            data_dir,
-            n_classes,
+            config,
            split=split,
            is_transform=is_transform,
            augmentations=augmentations,
@ -304,8 +299,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):
 class TestSectionLoader(SectionLoader):
    """
    Test data loader for sections
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -316,8 +310,7 @@ class TestSectionLoader(SectionLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="test1",
        is_transform=True,
        augmentations=None,
@ -326,7 +319,7 @@ class TestSectionLoader(SectionLoader):
        debug=False,
    ):
        super(TestSectionLoader, self).__init__(
-            data_dir, n_classes, split=split, is_transform=is_transform, augmentations=augmentations, debug=debug,
+            config, split=split, is_transform=is_transform, augmentations=augmentations, debug=debug,
        )

        if "test1" in self.split:
@ -356,8 +349,7 @@ class TestSectionLoader(SectionLoader):
 class TestSectionLoaderWithDepth(TestSectionLoader):
    """
    Test data loader for sections that includes additional channel for depth
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -368,8 +360,7 @@ class TestSectionLoaderWithDepth(TestSectionLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="test1",
        is_transform=True,
        augmentations=None,
@ -378,8 +369,7 @@ class TestSectionLoaderWithDepth(TestSectionLoader):
        debug=False,
    ):
        super(TestSectionLoaderWithDepth, self).__init__(
-            data_dir,
-            n_classes,
+            config,
            split=split,
            is_transform=is_transform,
            augmentations=augmentations,
@ -407,11 +397,11 @@ class TestSectionLoaderWithDepth(TestSectionLoader):

        # dump images before augmentation
        if self.debug:
-            outdir = f"debug/testSectionLoaderWithDepth_{self.split}_raw"
+            outdir = f"debug/test/testSectionLoaderWithDepth_{self.split}_raw"
            generate_path(outdir)
            # this needs to take the first dimension of image (no depth) but lbl only has 1 dim
            path_prefix = f"{outdir}/index_{index}_section_{section_name}"
-            image_to_disk(im[0, :, :], path_prefix + "_img.png")
+            image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.augmentations is not None:
@ -425,12 +415,10 @@ class TestSectionLoaderWithDepth(TestSectionLoader):

        # dump images and labels to disk after augmentation
        if self.debug:
-            outdir = (
-                f"debug/testSectionLoaderWithDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
-            )
+            outdir = f"debug/test/testSectionLoaderWithDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{section_name}"
-            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
+            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)

        return im, lbl
@ -444,33 +432,41 @@ def _transform_WH_to_HW(numpy_array):
 class PatchLoader(data.Dataset):
    """
    Base Data loader for the patch-based deconvnet
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
-    :param int stride: training data stride
-    :param int patch_size: Size of patch for training
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
    :param bool debug: enable debugging output
    """

-    def __init__(
-        self, data_dir, n_classes, stride=30, patch_size=99, is_transform=True, augmentations=None, debug=False,
-    ):
-        self.data_dir = data_dir
+    def __init__(self, config, split="train", is_transform=True, augmentations=None, debug=False):
+        self.data_dir = config.DATASET.ROOT
+        self.n_classes = config.DATASET.NUM_CLASSES
+        self.split = split
+        self.MIN = config.DATASET.MIN
+        self.MAX = config.DATASET.MAX
+        self.patch_size = config.TRAIN.PATCH_SIZE
+        self.stride = config.TRAIN.STRIDE
        self.is_transform = is_transform
        self.augmentations = augmentations
-        self.n_classes = n_classes
        self.patches = list()
-        self.patch_size = patch_size
-        self.stride = stride
        self.debug = debug

-    def pad_volume(self, volume):
+    def pad_volume(self, volume, value):
        """
-        Only used for train/val!! Not test.
+        Pads a 3D numpy array with a constant value along the depth direction only. 
+
+        Args:
+            volume (numpy ndarrray): numpy array containing the seismic amplitude or labels. 
+            value (int): value to pad the array with. 
        """
-        return np.pad(volume, pad_width=self.patch_size, mode="constant", constant_values=255)
+
+        return np.pad(
+            volume,
+            pad_width=[(0, 0), (0, 0), (self.patch_size, self.patch_size)],
+            mode="constant",
+            constant_values=value,
+        )

    def __len__(self):
        return len(self.patches)
@ -479,12 +475,7 @@ class PatchLoader(data.Dataset):

        patch_name = self.patches[index]
        direction, idx, xdx, ddx = patch_name.split(sep="_")
-
-        # Shift offsets the padding that is added in training
-        # shift = self.patch_size if "test" not in self.split else 0
-        # Remember we are cancelling the shift since we no longer pad
-        shift = 0
-        idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
+        idx, xdx, ddx = int(idx), int(xdx), int(ddx)

        if direction == "i":
            im = self.seismic[idx, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -500,7 +491,7 @@ class PatchLoader(data.Dataset):
            outdir = f"debug/patchLoader_{self.split}_raw"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
-            image_to_disk(im, path_prefix + "_img.png")
+            image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.augmentations is not None:
@ -512,7 +503,7 @@ class PatchLoader(data.Dataset):
            outdir = f"patchLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
            generate_path(outdir)
            path_prefix = f"{outdir}/{index}"
-            image_to_disk(im, path_prefix + "_img.png")
+            image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.is_transform:
@ -523,7 +514,7 @@ class PatchLoader(data.Dataset):
            outdir = f"debug/patchLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
-            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
+            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)

        return im, lbl
@ -536,46 +527,10 @@ class PatchLoader(data.Dataset):
        return torch.from_numpy(img).float(), torch.from_numpy(lbl).long()


-class TestPatchLoader(PatchLoader):
-    """
-    Test Data loader for the patch-based deconvnet
-    :param str data_dir: Root directory for training/test data
-    :param str n_classes: number of segmentation mask classes
-    :param int stride: training data stride
-    :param int patch_size: Size of patch for training
-    :param bool is_transform: Transform patch to dimensions expected by PyTorch
-    :param list augmentations: Data augmentations to apply to patches
-    :param bool debug: enable debugging output
-    """
-
-    def __init__(
-        self, data_dir, n_classes, stride=30, patch_size=99, is_transform=True, augmentations=None, debug=False
-    ):
-        super(TestPatchLoader, self).__init__(
-            data_dir,
-            n_classes,
-            stride=stride,
-            patch_size=patch_size,
-            is_transform=is_transform,
-            augmentations=augmentations,
-            debug=debug,
-        )
-        ## Warning: this is not used or tested
-        raise NotImplementedError("This class is not correctly implemented.")
-        self.seismic = np.load(_train_data_for(self.data_dir))
-        self.labels = np.load(_train_labels_for(self.data_dir))
-
-        patch_list = tuple(open(txt_path, "r"))
-        patch_list = [id_.rstrip() for id_ in patch_list]
-        self.patches = patch_list
-
-
 class TrainPatchLoader(PatchLoader):
    """
    Train data loader for the patch-based deconvnet
-    :param str data_dir: Root directory for training/test data
-    :param int stride: training data stride
-    :param int patch_size: Size of patch for training
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -584,11 +539,8 @@ class TrainPatchLoader(PatchLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="train",
-        stride=30,
-        patch_size=99,
        is_transform=True,
        augmentations=None,
        seismic_path=None,
@ -596,16 +548,9 @@ class TrainPatchLoader(PatchLoader):
        debug=False,
    ):
        super(TrainPatchLoader, self).__init__(
-            data_dir,
-            n_classes,
-            stride=stride,
-            patch_size=patch_size,
-            is_transform=is_transform,
-            augmentations=augmentations,
-            debug=debug,
+            config, is_transform=is_transform, augmentations=augmentations, debug=debug,
        )

-        warnings.warn("This no longer pads the volume")
        if seismic_path is not None and label_path is not None:
            # Load npy files (seismc and corresponding labels) from provided
            # location (path)
@ -618,8 +563,11 @@ class TrainPatchLoader(PatchLoader):
        else:
            self.seismic = np.load(_train_data_for(self.data_dir))
            self.labels = np.load(_train_labels_for(self.data_dir))
-        # We are in train/val mode. Most likely the test splits are not saved yet,
-        # so don't attempt to load them.
+
+        # pad the data:
+        self.seismic = self.pad_volume(self.seismic, value=0)
+        self.labels = self.pad_volume(self.labels, value=255)
+
        self.split = split
        # reading the file names for split
        txt_path = path.join(self.data_dir, "splits", "patch_" + split + ".txt")
@ -631,9 +579,7 @@ class TrainPatchLoader(PatchLoader):
 class TrainPatchLoaderWithDepth(TrainPatchLoader):
    """
    Train data loader for the patch-based deconvnet with patch depth channel
-    :param str data_dir: Root directory for training/test data
-    :param int stride: training data stride
-    :param int patch_size: Size of patch for training
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -642,10 +588,8 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):

    def __init__(
        self,
-        data_dir,
+        config,
        split="train",
-        stride=30,
-        patch_size=99,
        is_transform=True,
        augmentations=None,
        seismic_path=None,
@ -653,10 +597,8 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):
        debug=False,
    ):
        super(TrainPatchLoaderWithDepth, self).__init__(
-            data_dir,
+            config,
            split=split,
-            stride=stride,
-            patch_size=patch_size,
            is_transform=is_transform,
            augmentations=augmentations,
            seismic_path=seismic_path,
@ -668,12 +610,7 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):

        patch_name = self.patches[index]
        direction, idx, xdx, ddx = patch_name.split(sep="_")
-
-        # Shift offsets the padding that is added in training
-        # shift = self.patch_size if "test" not in self.split else 0
-        # Remember we are cancelling the shift since we no longer pad
-        shift = 0
-        idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
+        idx, xdx, ddx = int(idx), int(xdx), int(ddx)

        if direction == "i":
            im = self.seismic[idx, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -705,9 +642,7 @@ def _transform_HWC_to_CHW(numpy_array):
 class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
    """
    Train data loader for the patch-based deconvnet section depth channel
-    :param str data_dir: Root directory for training/test data
-    :param int stride: training data stride
-    :param int patch_size: Size of patch for training
+    :param config: configuration object to define other attributes in loaders
    :param str split: split file to use for loading patches
    :param bool is_transform: Transform patch to dimensions expected by PyTorch
    :param list augmentations: Data augmentations to apply to patches
@ -718,11 +653,8 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):

    def __init__(
        self,
-        data_dir,
-        n_classes,
+        config,
        split="train",
-        stride=30,
-        patch_size=99,
        is_transform=True,
        augmentations=None,
        seismic_path=None,
@ -730,11 +662,8 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
        debug=False,
    ):
        super(TrainPatchLoaderWithSectionDepth, self).__init__(
-            data_dir,
-            n_classes,
+            config,
            split=split,
-            stride=stride,
-            patch_size=patch_size,
            is_transform=is_transform,
            augmentations=augmentations,
            seismic_path=seismic_path,
@ -747,12 +676,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):

        patch_name = self.patches[index]
        direction, idx, xdx, ddx = patch_name.split(sep="_")
-
-        # Shift offsets the padding that is added in training
-        # shift = self.patch_size if "test" not in self.split else 0
-        # Remember we are cancelling the shift since we no longer pad
-        shift = 0
-        idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
+        idx, xdx, ddx = int(idx), int(xdx), int(ddx)

        if direction == "i":
            im = self.seismic[idx, :, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -769,7 +693,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
            outdir = f"debug/patchLoaderWithSectionDepth_{self.split}_raw"
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
-            image_to_disk(im[0, :, :], path_prefix + "_img.png")
+            image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.augmentations is not None:
@ -783,7 +707,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
            outdir = f"patchLoaderWithSectionDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
            generate_path(outdir)
            path_prefix = f"{outdir}/{index}"
-            image_to_disk(im[0, :, :], path_prefix + "_img.png")
+            image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)

        if self.is_transform:
@ -796,7 +720,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
            )
            generate_path(outdir)
            path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
-            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
+            image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
            mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)

        return im, lbl
@ -812,8 +736,6 @@ _TRAIN_PATCH_LOADERS = {
    "patch": TrainPatchLoaderWithDepth,
 }

-_TRAIN_SECTION_LOADERS = {"section": TrainSectionLoaderWithDepth}
-

 def get_patch_loader(cfg):
    assert str(cfg.TRAIN.DEPTH).lower() in [
@ -825,6 +747,9 @@ def get_patch_loader(cfg):
    return _TRAIN_PATCH_LOADERS.get(cfg.TRAIN.DEPTH, TrainPatchLoader)


+_TRAIN_SECTION_LOADERS = {"section": TrainSectionLoaderWithDepth}
+
+
 def get_section_loader(cfg):
    assert str(cfg.TRAIN.DEPTH).lower() in [
        "section",
--- a/interpretation/deepseismic_interpretation/dutchf3/tests/test_dataloaders.py
+++ b/interpretation/deepseismic_interpretation/dutchf3/tests/test_dataloaders.py
@ -6,7 +6,11 @@ Tests for TrainLoader and TestLoader classes when overriding the file names of t

 import tempfile
 import numpy as np
-from interpretation.deepseismic_interpretation.dutchf3.data import get_test_loader, TrainPatchLoaderWithDepth, TrainSectionLoaderWithDepth
+from deepseismic_interpretation.dutchf3.data import (
+    get_test_loader,
+    TrainPatchLoaderWithDepth,
+    TrainSectionLoaderWithDepth,
+)
 import pytest
 import yacs.config
 import os
@ -15,8 +19,10 @@ import os
 IL = 5
 XL = 10
 D = 8
+N_CLASSES = 2
+
+CONFIG_FILE = "./experiments/interpretation/dutchf3_patch/configs/unet.yaml"

-CONFIG_FILE = "./examples/interpretation/notebooks/configs/unet.yaml"
 with open(CONFIG_FILE, "rt") as f_read:
    config = yacs.config.load_cfg(f_read)

@ -52,10 +58,11 @@ def test_TestSectionLoader_should_load_data_from_test1_set():
        generate_npy_files(os.path.join(data_dir, "test_once", "test1_labels.npy"), labels)

        txt_path = os.path.join(data_dir, "splits", "section_test1.txt")
-        open(txt_path, 'a').close()
+        open(txt_path, "a").close()

        TestSectionLoader = get_test_loader(config)
-        test_set = TestSectionLoader(data_dir = data_dir, split = 'test1')
+        config.merge_from_list(["DATASET.ROOT", data_dir])
+        test_set = TestSectionLoader(config, split="test1")

        assert_dimensions(test_set)

@ -74,10 +81,11 @@ def test_TestSectionLoader_should_load_data_from_test2_set():
        generate_npy_files(os.path.join(data_dir, "test_once", "test2_labels.npy"), labels)

        txt_path = os.path.join(data_dir, "splits", "section_test2.txt")
-        open(txt_path, 'a').close()
+        open(txt_path, "a").close()

        TestSectionLoader = get_test_loader(config)
-        test_set = TestSectionLoader(data_dir = data_dir, split = 'test2')
+        config.merge_from_list(["DATASET.ROOT", data_dir])
+        test_set = TestSectionLoader(config, split="test2")

        assert_dimensions(test_set)

@ -94,143 +102,21 @@ def test_TestSectionLoader_should_load_data_from_path_override_data():
        generate_npy_files(os.path.join(data_dir, "volume_name", "labels.npy"), labels)

        txt_path = os.path.join(data_dir, "splits", "section_volume_name.txt")
-        open(txt_path, 'a').close()
+        open(txt_path, "a").close()

        TestSectionLoader = get_test_loader(config)
-        test_set = TestSectionLoader(data_dir = data_dir,
-                                     split = "volume_name",
-                                     is_transform = True,
-                                     augmentations = None,
-                                     seismic_path = os.path.join(data_dir, "volume_name", "seismic.npy"),
-                                     label_path = os.path.join(data_dir, "volume_name", "labels.npy"))
+        config.merge_from_list(["DATASET.ROOT", data_dir])
+        test_set = TestSectionLoader(
+            config,
+            split="volume_name",
+            is_transform=True,
+            augmentations=None,
+            seismic_path=os.path.join(data_dir, "volume_name", "seismic.npy"),
+            label_path=os.path.join(data_dir, "volume_name", "labels.npy"),
+        )

        assert_dimensions(test_set)

-def test_TrainSectionLoaderWithDepth_should_fail_on_empty_file_names(tmpdir):
-    """
-    Check for exception when files do not exist
-    """
-
-    # Test
-    with pytest.raises(Exception) as excinfo:
-
-        _ = TrainSectionLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
-            is_transform=True,
-            augmentations=None,
-            seismic_path = "",
-            label_path = ""
-        )
-    assert "does not exist" in str(excinfo.value)
-
-
-def test_TrainSectionLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
-    """
-    Check for exception when training param is empty
-    """
-    # Setup
-    os.makedirs(os.path.join(tmpdir, "volume_name"))
-    os.makedirs(os.path.join(tmpdir, "splits"))
-
-    labels = np.ones([IL, XL, D])
-    generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
-
-    txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
-    open(txt_path, 'a').close()
-
-    # Test
-    with pytest.raises(Exception) as excinfo:
-
-        _ = TrainSectionLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
-            is_transform=True,
-            augmentations=None,
-            seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-            label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
-        )
-    assert "does not exist" in str(excinfo.value)
-
-
-def test_TrainSectionLoaderWithDepth_should_fail_on_missing_label_file(tmpdir):
-    """
-    Check for exception when training param is empty
-    """
-    # Setup
-    os.makedirs(os.path.join(tmpdir, "volume_name"))
-    os.makedirs(os.path.join(tmpdir, "splits"))
-
-    labels = np.ones([IL, XL, D])
-    generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
-
-    txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
-    open(txt_path, 'a').close()
-
-    # Test
-    with pytest.raises(Exception) as excinfo:
-
-        _ = TrainSectionLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
-            is_transform=True,
-            augmentations=None,
-            seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-            label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
-        )
-    assert "does not exist" in str(excinfo.value)
-
-
-def test_TrainSectionLoaderWithDepth_should_load_with_one_train_and_label_file(tmpdir):
-    """
-    Check for successful class instantiation w/ single npy file for train & label
-    """
-    # Setup
-    os.makedirs(os.path.join(tmpdir, "volume_name"))
-    os.makedirs(os.path.join(tmpdir, "splits"))
-
-    seimic = np.zeros([IL, XL, D])
-    generate_npy_files(os.path.join(tmpdir, "volume_name", "seismic.npy"), seimic)
-
-    labels = np.ones([IL, XL, D])
-    generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
-
-    txt_path = os.path.join(tmpdir, "splits", "section_volume_name.txt")
-    open(txt_path, 'a').close()
-
-    # Test
-    train_set = TrainSectionLoaderWithDepth(
-        data_dir = tmpdir,
-        split = "volume_name",
-        is_transform=True,
-        augmentations=None,
-        seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-        label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
-    )
-
-    assert train_set.labels.shape == (IL, XL, D)
-    assert train_set.seismic.shape == (IL, 3, XL, D)
-
-
-def test_TrainPatchLoaderWithDepth_should_fail_on_empty_file_names(tmpdir):
-    """
-    Check for exception when files do not exist
-    """
-    # Test
-    with pytest.raises(Exception) as excinfo:
-
-        _ = TrainPatchLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
-            is_transform=True,
-            stride=25,
-            patch_size=100,
-            augmentations=None,
-            seismic_path = "",
-            label_path = ""
-        )
-    assert "does not exist" in str(excinfo.value)
-

 def test_TrainPatchLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
    """
@ -244,20 +130,20 @@ def test_TrainPatchLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
    generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)

    txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
-    open(txt_path, 'a').close()
+    open(txt_path, "a").close()
+
+    config.merge_from_list(["DATASET.ROOT", str(tmpdir)])

    # Test
    with pytest.raises(Exception) as excinfo:

        _ = TrainPatchLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
+            config,
+            split="volume_name",
            is_transform=True,
-            stride=25,
-            patch_size=100,
            augmentations=None,
            seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-            label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
+            label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
        )
    assert "does not exist" in str(excinfo.value)

@ -274,20 +160,20 @@ def test_TrainPatchLoaderWithDepth_should_fail_on_missing_label_file(tmpdir):
    generate_npy_files(os.path.join(tmpdir, "volume_name", "seismic.npy"), seimic)

    txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
-    open(txt_path, 'a').close()
+    open(txt_path, "a").close()
+
+    config.merge_from_list(["DATASET.ROOT", str(tmpdir)])

    # Test
    with pytest.raises(Exception) as excinfo:

        _ = TrainPatchLoaderWithDepth(
-            data_dir = tmpdir,
-            split = "volume_name",
+            config,
+            split="volume_name",
            is_transform=True,
-            stride=25,
-            patch_size=100,
            augmentations=None,
            seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-            label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
+            label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
        )
    assert "does not exist" in str(excinfo.value)

@ -306,20 +192,21 @@ def test_TrainPatchLoaderWithDepth_should_load_with_one_train_and_label_file(tmp
    labels = np.ones([IL, XL, D])
    generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)

-    txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
-    open(txt_path, 'a').close()
+    txt_dir = os.path.join(tmpdir, "splits")
+    txt_path = os.path.join(txt_dir, "patch_volume_name.txt")
+    open(txt_path, "a").close()
+
+    config.merge_from_list(["DATASET.ROOT", str(tmpdir)])

    # Test
    train_set = TrainPatchLoaderWithDepth(
-        data_dir = tmpdir,
-        split = "volume_name",
+        config,
+        split="volume_name",
        is_transform=True,
-        stride=25,
-        patch_size=100,
        augmentations=None,
        seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
-        label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
+        label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
    )

-    assert train_set.labels.shape == (IL, XL, D)
-    assert train_set.seismic.shape == (IL, XL, D)
+    assert train_set.labels.shape == (IL, XL, D + 2 * config.TRAIN.PATCH_SIZE)
+    assert train_set.seismic.shape == (IL, XL, D + 2 * config.TRAIN.PATCH_SIZE)
--- a/interpretation/deepseismic_interpretation/dutchf3/utils/init.py
+++ b/interpretation/deepseismic_interpretation/dutchf3/utils/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/interpretation/deepseismic_interpretation/models/init.py
+++ b/interpretation/deepseismic_interpretation/models/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/interpretation/deepseismic_interpretation/penobscot/init.py
+++ b/interpretation/deepseismic_interpretation/penobscot/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/interpretation/deepseismic_interpretation/penobscot/metrics.py
+++ b/interpretation/deepseismic_interpretation/penobscot/metrics.py
@ -18,7 +18,7 @@ def _torch_hist(label_true, label_pred, n_class):
    Returns:
        [type]: [description]
    """
-    
+
    assert len(label_true.shape) == 1, "Labels need to be 1D"
    assert len(label_pred.shape) == 1, "Predictions need to be 1D"
    mask = (label_true >= 0) & (label_true < n_class)
--- a/interpretation/deepseismic_interpretation/segyconverter/convert_segy.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/convert_segy.py
@ -0,0 +1,148 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Utility Script to convert segy files to blocks of numpy arrays and save to individual npy files
+"""
+
+import os
+import timeit
+import argparse
+import numpy as np
+from deepseismic_interpretation.segyconverter.utils import segyextract, dataprep
+import json
+
+K = 12
+MIN_VAL = 0
+MAX_VAL = 1
+
+
+def filter_data(output_dir, stddev_file, k, min_range, max_range, clip, normalize):
+    """
+    Normalization step on all files in output_dir. This function overwrites the existing
+    data file
+    :param str output_dir: Directory path of all npy files to normalize
+    :param str stddev_file: txt file containing standard deviation result
+    :param int k: number of standard deviation to be used in normalization
+    :param float min_range: minium range value
+    :param float max_range: maximum range value
+    :param clip: flag to turn on/off clip
+    :param normalize: flag to turn on/off normalization.
+    """
+    txt_file = os.path.join(output_dir, stddev_file)
+    if not os.path.isfile(txt_file):
+        raise Exception("Std Deviation file could not be found")
+    with open(os.path.join(txt_file), "r") as f:
+        metadatastr = f.read()
+
+    try:
+        metadata = json.loads(metadatastr)
+        stddev = float(metadata["stddev"])
+        mean = float(metadata["mean"])
+    except ValueError:
+        raise Exception("stddev value not valid: {}".format(metadatastr))
+
+    npy_files = list(f for f in os.listdir(output_dir) if f.endswith(".npy"))
+    for local_filename in npy_files:
+        cube = np.load(os.path.join(output_dir, local_filename))
+        if normalize or clip:
+            cube = dataprep.apply(cube, stddev, mean, k, min_range, max_range, clip=clip, normalize=normalize)
+        np.save(os.path.join(output_dir, local_filename), cube)
+
+
+def main(
+    input_file,
+    output_dir,
+    prefix,
+    iline=189,
+    xline=193,
+    metadata_only=False,
+    stride=128,
+    cube_size=-1,
+    normalize=True,
+    clip=True,
+):
+    """
+    Select a single column out of the segy file and generate all cubes in the z(time)
+    direction. The column is indexed by the inline and xline. To use this command, you
+    should have already run the metadata extract to determine the
+    ranges of the inlines and xlines. It will error out if the range is incorrect
+
+    Sample call: python3 convert_segy.py --input_file
+                seismic_data.segy --prefix seismic --output_dir ./seismic
+
+    :param str input_file: input segy file path
+    :param str output_dir: output directory to save npy files
+    :param str prefix: file prefix for npy files
+    :param int iline: byte location for inlines
+    :param int xline: byte location for crosslines
+    :param bool metadata_only: Only return the metadata of the segy file
+    :param int stride: overlap between cubes - stride == cube_size = no overlap
+    :param int cube_size: size of cubes to generate
+    """
+
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+
+    fast_indexes, slow_indexes, trace_headers, sample_size = segyextract.get_segy_metadata(input_file, iline, xline)
+
+    print("\tFast Lines: {} to {} ({} lines)".format(np.min(fast_indexes), np.max(fast_indexes), len(fast_indexes)))
+    print("\tSlow Lines: {} to {} ({} lines)".format(np.min(slow_indexes), np.max(slow_indexes), len(slow_indexes)))
+    print("\tSample Size: {}".format(sample_size))
+    print("\tTrace Count: {}".format(len(trace_headers)))
+    print("\tFirst five distinct Fast Line Indexes: {}".format(fast_indexes[0:5]))
+    print("\tFirst five distinct Slow Line Indexes: {}".format(slow_indexes[0:5]))
+    print("\tFirst five fast trace ids: {}".format(trace_headers["fast"][0:5].values))
+    print("\tFirst five slow trace ids: {}".format(trace_headers["slow"][0:5].values))
+
+    if not metadata_only:
+        process_time_segy = 0
+        if cube_size == -1:
+            # only generate on npy
+            wrapped_processor_segy = segyextract.timewrapper(
+                segyextract.process_segy_data_into_single_array, input_file, output_dir, prefix, iline, xline
+            )
+            process_time_segy = timeit.timeit(wrapped_processor_segy, number=1)
+        else:
+            wrapped_processor_segy = segyextract.timewrapper(
+                segyextract.process_segy_data, input_file, output_dir, prefix, stride=stride, n_points=cube_size
+            )
+            process_time_segy = timeit.timeit(wrapped_processor_segy, number=1)
+        print(f"Completed SEG-Y converstion in: {process_time_segy}")
+        # At this point, there should be npy files in the output directory + one file containing the std deviation found in the segy
+        print("Preparing File")
+        timed_filter_data = segyextract.timewrapper(
+            filter_data, output_dir, f"{prefix}_stats.json", K, MIN_VAL, MAX_VAL, clip=clip, normalize=normalize
+        )
+        process_time_normalize = timeit.timeit(timed_filter_data, number=1)
+        print(f"Completed file preparation in {process_time_normalize} seconds")
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser("train")
+    parser.add_argument("--prefix", type=str, help="prefix label for output files", required=True)
+    parser.add_argument("--input_file", type=str, help="segy file path", required=True)
+    parser.add_argument("--output_dir", type=str, help="Output files are written to this directory", default=".")
+    parser.add_argument("--metadata_only", action="store_true", help="Only produce inline,xline metadata")
+    parser.add_argument("--iline", type=int, default=189, help="segy file path")
+    parser.add_argument("--xline", type=int, default=193, help="segy file path")
+    parser.add_argument("--cube_size", type=int, default=-1, help="cube dimensions")
+    parser.add_argument("--stride", type=int, default=128, help="stride")
+    parser.add_argument("--normalize", action="store_true", help="Normalization flag -  clip and normalize the data")
+    parser.add_argument("--clip", action="store_true", help="Clipping flag -  only clip the data")
+
+    args = parser.parse_args()
+    localfile = args.input_file
+
+    main(
+        args.input_file,
+        args.output_dir,
+        args.prefix,
+        args.iline,
+        args.xline,
+        args.metadata_only,
+        args.stride,
+        args.cube_size,
+        args.normalize,
+        args.clip,
+    )
--- a/interpretation/deepseismic_interpretation/segyconverter/test/test_convert_segy.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/test/test_convert_segy.py
@ -0,0 +1,212 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Test that the current scripts can run from the command line
+"""
+import os
+import numpy as np
+from deepseismic_interpretation.segyconverter import convert_segy
+from deepseismic_interpretation.segyconverter.test import test_util
+import pytest
+import segyio
+
+MAX_RANGE = 1
+MIN_RANGE = 0
+ERROR_EXIT_CODE = 99
+
+
+@pytest.fixture(scope="class")
+def segy_single_file(request):
+    # setup code
+    # create segy file
+    inlinefile = "./inlinesortsample.segy"
+    test_util.create_segy_file(
+        lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
+        inlinefile,
+        segyio.TraceSortingFormat.INLINE_SORTING,
+    )
+
+    # inject class variables
+    request.cls.testfile = inlinefile
+    yield
+
+    # teardown code
+    os.remove(inlinefile)
+
+
+@pytest.mark.usefixtures("segy_single_file")
+class TestConvertSEGY:
+
+    testfile = None  # Set by segy_file fixture
+
+    def test_convert_segy_generates_single_npy(self, tmpdir):
+        # Setup
+        prefix = "volume1"
+        input_file = self.testfile
+        output_dir = tmpdir.strpath
+        metadata_only = False
+        iline = 189
+        xline = 193
+        cube_size = -1
+        stride = 128
+        normalize = True
+        clip = True
+        inputpath = ""
+
+        # Test
+        convert_segy.main(
+            input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
+        )
+
+        # Validate
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 1
+
+        min_val, max_val = _get_min_max(tmpdir.strpath)
+        assert min_val >= MIN_RANGE
+        assert max_val <= MAX_RANGE
+
+    def test_convert_segy_generates_multiple_npy_files(self, tmpdir):
+        """
+        Run process_all_files and checks that it returns with 0 exit code
+        :param function filedir: fixture for setup and cleanup
+        """
+
+        # Setup
+        prefix = "volume1"
+        input_file = self.testfile
+        output_dir = tmpdir.strpath
+        metadata_only = False
+        iline = 189
+        xline = 193
+        cube_size = 128
+        stride = 128
+        normalize = True
+        inputpath = ""
+        clip = True
+        # Test
+        convert_segy.main(
+            input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
+        )
+
+        # Validate
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 2
+
+    def test_convert_segy_normalizes_data(self, tmpdir):
+        """
+        Run process_all_files and checks that it returns with 0 exit code
+        :param function filedir: fixture for setup and cleanup
+        """
+
+        # Setup
+        prefix = "volume1"
+        input_file = self.testfile
+        output_dir = tmpdir.strpath
+        metadata_only = False
+        iline = 189
+        xline = 193
+        cube_size = 128
+        stride = 128
+        normalize = True
+        inputpath = ""
+        clip = True
+
+        # Test
+        convert_segy.main(
+            input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
+        )
+
+        # Validate
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 2
+        min_val, max_val = _get_min_max(tmpdir.strpath)
+        assert min_val >= MIN_RANGE
+        assert max_val <= MAX_RANGE
+
+    def test_convert_segy_clips_data(self, tmpdir):
+        """
+        Run process_all_files and checks that it returns with 0 exit code
+        :param function filedir: fixture for setup and cleanup
+        """
+
+        # Setup
+        prefix = "volume1"
+        input_file = self.testfile
+        output_dir = tmpdir.strpath
+        metadata_only = False
+        iline = 189
+        xline = 193
+        cube_size = 128
+        stride = 128
+        normalize = False
+        inputpath = ""
+        clip = True
+
+        # Test
+        convert_segy.main(
+            input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
+        )
+
+        # Validate
+        expected_max = 35.59
+        expected_min = -35.59
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 2
+        min_val, max_val = _get_min_max(tmpdir.strpath)
+        assert expected_min == pytest.approx(min_val, rel=1e-3)
+        assert expected_max == pytest.approx(max_val, rel=1e-3)
+
+    def test_convert_segy_copies_exact_data_with_no_normalization(self, tmpdir):
+        """
+        Run process_all_files and checks that it returns with 0 exit code
+        :param function filedir: fixture for setup and cleanup
+        """
+
+        # Setup
+        prefix = "volume1"
+        input_file = self.testfile
+        output_dir = tmpdir.strpath
+        metadata_only = False
+        iline = 189
+        xline = 193
+        cube_size = 128
+        stride = 128
+        normalize = False
+        inputpath = ""
+        clip = False
+
+        # Test
+        convert_segy.main(
+            input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
+        )
+
+        # Validate
+        expected_max = 1039.8
+        expected_min = -1039.8
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 2
+        min_val, max_val = _get_min_max(tmpdir.strpath)
+        assert expected_min == pytest.approx(min_val, rel=1e-3)
+        assert expected_max == pytest.approx(max_val, rel=1e-3)
+
+
+def _get_min_max(outputdir):
+    """
+    Check # of npy files in directory
+    :param str outputdir: directory to check for npy files
+    :returns: min_val, max_val of values in npy files
+    :rtype: int, int
+    """
+    min_val = 0
+    max_val = 0
+    npy_files = test_util.get_npy_files(outputdir)
+    for file in npy_files:
+        data = np.load(os.path.join(outputdir, file))
+        this_min = np.amin(data)
+        this_max = np.amax(data)
+        if this_min < min_val:
+            min_val = this_min
+        if this_max > max_val:
+            max_val = this_max
+    return min_val, max_val
--- a/interpretation/deepseismic_interpretation/segyconverter/test/test_dataprep.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/test/test_dataprep.py
@ -0,0 +1,166 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+Test data normalization
+"""
+import numpy as np
+from deepseismic_interpretation.segyconverter.utils import dataprep
+import pytest
+
+INPUT_FOLDER = "./contrib/segyconverter/test/test_data"
+MAX_RANGE = 1
+MIN_RANGE = 0
+K = 12
+
+
+class TestNormalizeCube:
+
+    testcube = None  # Set by npy_files fixture
+
+    def test_normalize_cube_returns_normalized_values(self):
+        """
+            Test method that normalize one cube by checking if normalized 
+            values are within [min, max] range.
+        """
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        # Add values to clip
+        cube[40, 25, 50] = 700
+        cube[70, 30, 70] = -700
+        mean = np.mean(cube)
+        variance = np.var(cube)
+        stddev = np.sqrt(variance)
+        min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        norm_block = dataprep.normalize_cube(cube, min_clip, max_clip, scale, MIN_RANGE, MAX_RANGE)
+        assert np.amax(norm_block) <= MAX_RANGE
+        assert np.amin(norm_block) >= MIN_RANGE
+
+    def test_clip_cube_returns_clipped_values(self):
+        """
+            Test method that clip one cube by checking if clipped 
+            values are within [min_clip, max_clip] range.
+        """
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        # Add values to clip
+        cube[40, 25, 50] = 700
+        cube[70, 30, 70] = -700
+        mean = np.mean(cube)
+        variance = np.var(cube)
+        stddev = np.sqrt(variance)
+        min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        clipped_block = dataprep.clip_cube(cube, min_clip, max_clip)
+        assert np.amax(clipped_block) <= max_clip
+        assert np.amin(clipped_block) >= min_clip
+
+    def test_norm_value_is_correct(self):
+        # Check if normalized value is calculated correctly
+        min_clip = -18469.875210304104
+        max_clip = 18469.875210304104
+        scale = 2.707110872741882e-05
+        input_value = 2019
+        expected_norm_value = 0.5546565685206586
+        norm_v = dataprep.norm_value(input_value, min_clip, max_clip, MIN_RANGE, MAX_RANGE, scale)
+        assert norm_v == pytest.approx(expected_norm_value, rel=1e-3)
+
+    def test_clip_value_is_correct(self):
+        # Check if normalized value is calculated correctly
+        min_clip = -18469.875210304104
+        max_clip = 18469.875210304104
+        input_value = 2019
+        expected_clipped_value = 2019
+        clipped_v = dataprep.clip_value(input_value, min_clip, max_clip)
+        assert clipped_v == pytest.approx(expected_clipped_value, rel=1e-3)
+
+    def test_norm_value_on_cube_is_within_range(self):
+        # Check if normalized value is within [MIN_RANGE, MAX_RANGE]
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        cube[40, 25, 50] = 7000
+        cube[70, 30, 70] = -7000
+        variance = np.var(cube)
+        stddev = np.sqrt(variance)
+        mean = np.mean(cube)
+        v = cube[10, 40, 5]
+        min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        norm_v = dataprep.norm_value(v, min_clip, max_clip, MIN_RANGE, MAX_RANGE, scale)
+        assert norm_v <= MAX_RANGE
+        assert norm_v >= MIN_RANGE
+
+        pytest.raises(Exception, dataprep.norm_value, v, min_clip * 10, max_clip * 10, MIN_RANGE, MAX_RANGE, scale * 10)
+
+    def test_clipped_value_on_cube_is_within_range(self):
+        # Check if clipped value is within [min_clip, max_clip]
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        cube[40, 25, 50] = 7000
+        cube[70, 30, 70] = -7000
+        variance = np.var(cube)
+        mean = np.mean(cube)
+        stddev = np.sqrt(variance)
+        v = cube[10, 40, 5]
+        min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        clipped_v = dataprep.clip_value(v, min_clip, max_clip)
+        assert clipped_v <= max_clip
+        assert clipped_v >= min_clip
+
+    def test_compute_statistics(self):
+        # Check if statistics are calculated correctly for provided stddev, max_range and k values
+        expected_min_clip = -138.693888
+        expected_max_clip = 138.693888
+        expected_scale = 0.003605061529459755
+        mean = 0
+        stddev = 11.557824
+        min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        assert expected_min_clip == pytest.approx(min_clip, rel=1e-3)
+        assert expected_max_clip == pytest.approx(max_clip, rel=1e-3)
+        assert expected_scale == pytest.approx(scale, rel=1e-3)
+        # Testing division by zero
+        pytest.raises(Exception, dataprep.compute_statistics, stddev, MAX_RANGE, 0)
+        pytest.raises(Exception, dataprep.compute_statistics, 0, MAX_RANGE, 0)
+
+    def test_apply_should_clip_and_normalize_data(self):
+        # Check that apply method will clip and normalize the data
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        cube[40, 25, 50] = 7000
+        cube[70, 30, 70] = -7000
+        variance = np.var(cube)
+        stddev = np.sqrt(variance)
+        mean = np.mean(cube)
+
+        norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE)
+        assert np.amax(norm_block) <= MAX_RANGE
+        assert np.amin(norm_block) >= MIN_RANGE
+
+        norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE, clip=False)
+        assert np.amax(norm_block) <= MAX_RANGE
+        assert np.amin(norm_block) >= MIN_RANGE
+
+        pytest.raises(Exception, dataprep.apply, cube, stddev, 0, MIN_RANGE, MAX_RANGE)
+        pytest.raises(Exception, dataprep.apply, cube, 0, K, MIN_RANGE, MAX_RANGE)
+
+        invalid_cube = np.empty_like(cube)
+        invalid_cube[:] = np.nan
+        pytest.raises(Exception, dataprep.apply, invalid_cube, stddev, 0, MIN_RANGE, MAX_RANGE)
+
+    def test_apply_should_clip_data(self):
+        # Check that apply method will clip the data
+        trace = np.linspace(-1, 1, 100, True, dtype=np.single)
+        cube = np.ones((100, 50, 100)) * trace * 500
+        cube[40, 25, 50] = 7000
+        cube[70, 30, 70] = -7000
+        variance = np.var(cube)
+        stddev = np.sqrt(variance)
+        mean = np.mean(cube)
+        min_clip, max_clip, _ = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
+        norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE, clip=True, normalize=False)
+        assert np.amax(norm_block) <= max_clip
+        assert np.amin(norm_block) >= min_clip
+
+        invalid_cube = np.empty_like(cube)
+        invalid_cube[:] = np.nan
+        pytest.raises(
+            Exception, dataprep.apply, invalid_cube, stddev, 0, MIN_RANGE, MAX_RANGE, clip=True, normalize=False
+        )
--- a/interpretation/deepseismic_interpretation/segyconverter/test/test_segyextract_npyeval.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/test/test_segyextract_npyeval.py
@ -0,0 +1,317 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Test the extract functions against a variety of SEGY files and trace_header scenarioes
+"""
+import os
+import pytest
+import numpy as np
+import pandas as pd
+from deepseismic_interpretation.segyconverter.utils import segyextract
+from deepseismic_interpretation.segyconverter.test import test_util
+import segyio
+import json
+
+FILENAME = "./normalsegy.segy"
+PREFIX = "normal"
+
+
+@pytest.fixture(scope="class")
+def segy_all_files(request):
+    # setup code
+    # create segy file
+    normal_filename = "./normalsegy.segy"
+    test_util.create_segy_file(lambda il, xl: True, normal_filename)
+    request.cls.control_file = normal_filename  # Set by segy_file fixture
+
+    inline_filename = "./inlineerror.segy"
+    test_util.create_segy_file(
+        lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
+        inline_filename,
+        segyio.TraceSortingFormat.INLINE_SORTING,
+    )
+    request.cls.inline_sort_file = inline_filename  # Set by segy_file fixture
+
+    xline_filename = "./xlineerror.segy"
+    test_util.create_segy_file(
+        lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
+        xline_filename,
+        segyio.TraceSortingFormat.CROSSLINE_SORTING,
+    )
+    request.cls.crossline_sort_file = xline_filename  # Set by segy_file fixture
+
+    hole_filename = "./hole.segy"
+    test_util.create_segy_file(
+        lambda il, xl: not ((20 < il < 30) and (150 < xl < 250)),
+        hole_filename,
+        segyio.TraceSortingFormat.INLINE_SORTING,
+    )
+    request.cls.hole_file = hole_filename  # Set by segy_file fixture
+
+    yield
+
+    # teardown code
+    os.remove(normal_filename)
+    os.remove(inline_filename)
+    os.remove(xline_filename)
+    os.remove(hole_filename)
+
+
+@pytest.mark.usefixtures("segy_all_files")
+class TestSEGYExtract:
+
+    control_file = None  # Set by segy_file fixture
+    inline_sort_file = None  # Set by segy_file fixture
+    crossline_sort_file = None  # Set by segy_file fixture
+    hole_file = None  # Set by segy_file fixture
+
+    @pytest.mark.parametrize(
+        "filename, trace_count, first_inline, inline_count, first_xline, xline_count, depth",
+        [
+            ("./normalsegy.segy", 8000, 10, 40, 100, 200, 10),
+            ("./inlineerror.segy", 7309, 10, 40, 125, 200, 10),
+            ("./xlineerror.segy", 7309, 10, 40, 125, 200, 10),
+            ("./hole.segy", 7109, 10, 40, 100, 200, 10),
+        ],
+    )
+    def test_get_segy_metadata_should_return_correct_metadata(
+        self, filename, trace_count, first_inline, inline_count, first_xline, xline_count, depth
+    ):
+        """
+        Check that get_segy_metadata can correctly identify the sorting from the trace headers
+        :param dict tmpdir: pytest fixture for local test directory cleanup
+        :param str filename: SEG-Y filename
+        :param int inline: byte location for inline
+        :param int xline: byte location for crossline
+        :param int depth: number of samples
+        """
+        # setup
+        inline_byte_loc = 189
+        xline_byte_loc = 193
+
+        # test
+        fast_indexes, slow_indexes, trace_headers, sample_size = segyextract.get_segy_metadata(
+            filename, inline_byte_loc, xline_byte_loc
+        )
+
+        # validate
+        assert sample_size == depth
+        assert len(trace_headers) == trace_count
+        assert len(fast_indexes) == inline_count
+        assert len(slow_indexes) == xline_count
+
+        # Check fast direction
+        assert trace_headers["slow"][0] == first_xline
+        assert trace_headers["fast"][0] == first_inline
+
+    @pytest.mark.parametrize(
+        "filename,inline,xline,depth",
+        [
+            ("./normalsegy.segy", 40, 200, 10),
+            ("./inlineerror.segy", 40, 200, 10),
+            ("./xlineerror.segy", 40, 200, 10),
+            ("./hole.segy", 40, 200, 10),
+        ],
+    )
+    def test_process_segy_data_should_create_cube_size_equal_to_segy(self, tmpdir, filename, inline, xline, depth):
+        """
+        Create single npy file for segy and validate size
+        :param dict tmpdir: pytest fixture for local test directory cleanup
+        :param str filename: SEG-Y filename
+        :param int inline: byte location for inline
+        :param int xline: byte location for crossline
+        :param int depth: number of samples
+        """
+        segyextract.process_segy_data_into_single_array(filename, tmpdir.strpath, PREFIX)
+
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 1
+
+        data = np.load(os.path.join(tmpdir.strpath, npy_files[0]))
+        assert len(data.shape) == 3
+        assert data.shape[0] == inline
+        assert data.shape[1] == xline
+        assert data.shape[2] == depth
+
+    def test_process_segy_data_should_write_npy_files_for_n_equals_128_stride_64(self, tmpdir):
+        """
+        Break data up into size n=128 size blocks and validate against original segy
+        file. This size of block causes the code to write 1 x 4 npy files
+        :param function tmpdir: pytest fixture for local test directory cleanup
+        """
+        # setup
+        n_points = 128
+        stride = 64
+
+        # test
+        segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=stride)
+
+        # validate
+        _output_npy_files_are_correct_for_cube_size(4, 128, tmpdir.strpath)
+
+    def test_process_segy_data_should_write_npy_files_for_n_equals_128(self, tmpdir):
+        """
+        Break data up into size n=128 size blocks and validate against original segy
+        file. This size of block causes the code to write 1 x 4 npy files
+        :param function tmpdir: pytest fixture for local test directory cleanup
+        """
+        # setup
+        n_points = 128
+
+        # test
+        segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX)
+
+        # validate
+        npy_files = _output_npy_files_are_correct_for_cube_size(2, 128, tmpdir.strpath)
+
+        full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
+
+        # Validate contents of volume
+        _compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
+
+    def test_process_segy_data_should_write_npy_files_for_n_equals_64(self, tmpdir):
+        """
+        Break data up into size n=64 size blocks and validate against original segy
+        file. This size of block causes the code to write 1 x 8 npy files
+        :param function tmpdir: pytest fixture for local test directory cleanup
+        """
+        # setup
+
+        n_points = 64
+        expected_file_count = 4
+        # test
+        segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=n_points)
+
+        # validate
+        npy_files = _output_npy_files_are_correct_for_cube_size(expected_file_count, n_points, tmpdir.strpath)
+
+        full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
+
+        # Validate contents of volume
+        _compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
+
+    def test_process_segy_data_should_write_npy_files_for_n_equals_16(self, tmpdir):
+        """
+        Break data up into size n=16 size blocks and validate against original segy
+        file. This size of block causes the code to write 2 x 4 x 32 npy files.
+        :param function tmpdir: pytest fixture for local test directory cleanup
+        """
+        # setup
+        n_points = 16
+
+        # test
+        segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=n_points)
+
+        # validate
+        npy_files = _output_npy_files_are_correct_for_cube_size(39, 16, tmpdir.strpath)
+
+        full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
+        _compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
+
+    def test_process_npy_file_should_have_same_content_as_segy(self, tmpdir):
+        """
+        Check the actual content of a npy file generated from the segy
+        :param function tmpdir: pytest fixture for local test directory cleanup
+        """
+        segyextract.process_segy_data_into_single_array(FILENAME, tmpdir.strpath, PREFIX)
+
+        npy_files = test_util.get_npy_files(tmpdir.strpath)
+        assert len(npy_files) == 1
+
+        data = np.load(os.path.join(tmpdir.strpath, npy_files[0]))
+        _compare_output_to_segy(FILENAME, data, 40, 200, 10)
+
+    def test_remove_duplicates_should_keep_order(self):
+        # setup
+        list_with_dups = [1, 2, 3, 3, 5, 8, 4, 2]
+        # test
+        result = segyextract._remove_duplicates(list_with_dups)
+        # validate
+        expected_result = [1, 2, 3, 5, 8, 4]
+        assert all([a == b for a, b in zip(result, expected_result)])
+
+    def test_identify_fast_direction_should_handle_xline_sequence_1(self):
+        # setup
+        df = pd.DataFrame({"i": [101, 102, 102, 102, 103, 103], "j": [301, 301, 302, 303, 301, 302]})
+        # test
+        segyextract._identify_fast_direction(df, "fast", "slow")
+        # validate
+        assert df.keys()[0] == "fast"
+        assert df.keys()[1] == "slow"
+
+    def test_identify_fast_direction_should_handle_xline_sequence_2(self):
+        # setup
+        df = pd.DataFrame({"i": [101, 102, 102, 102, 102, 102], "j": [301, 301, 302, 303, 304, 305]})
+        # test
+        segyextract._identify_fast_direction(df, "fast", "slow")
+        # validate
+        assert df.keys()[0] == "fast"
+        assert df.keys()[1] == "slow"
+
+
+def _output_npy_files_are_correct_for_cube_size(expected_count, cube_size, outputdir):
+    """
+    Check # of npy files in directory
+    :param int expected_count: expected # of npy files
+    :param str outputdir: directory to check for npy files
+    :param int cube_size: size of cube array
+    :returns: npy_files in outputdir
+    :rtype: list
+    """
+    npy_files = test_util.get_npy_files(outputdir)
+    assert len(npy_files) == expected_count
+
+    data = np.load(os.path.join(outputdir, npy_files[0]))
+    assert len(data.shape) == 3
+    assert data.shape.count(cube_size) == 3
+
+    return npy_files
+
+
+def _compare_output_to_segy(filename, data, fast_size, slow_size, depth):
+    """
+    Compares each trace in the segy file to the data volume that
+    was generated from the npy file. This only works when a single npy
+    is created from a cuboid SEGY. If the dimensions are not aligned
+
+    :param str filename: path to segy file
+    :param nparray data: data read in from npy files
+    """
+    with segyio.open(filename, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+        segy_sum = np.float32(0.0)
+        npy_sum = np.float32(0.0)
+        # Validate that each trace in the segy file is represented in the npy files
+        # Sum traces in segy and npy to ensure they are correct
+        for j in range(0, fast_size):  # Fast
+            for i in range(0, slow_size):  # Slow
+                trace = segy_file.trace[i + (j * slow_size)]
+                data_trace = data[j, i, :]
+                assert all([a == b for a, b in zip(trace, data_trace)]), f"Unmatched trace at {j}:{i}"
+                segy_sum += np.sum(trace, dtype=np.float32)
+                npy_sum += np.sum(data_trace, dtype=np.float32)
+        assert segy_sum == npy_sum
+
+
+def _compare_variance(filename, prefix, data, outputdir):
+    """
+    Compares the standard deviation calculated from the full volume to the
+    standard deviation calculated while creating the npy files
+
+    :param str filename: path to segy file
+    :param str prefix: prefix used to find files
+    :param nparray data: data read in from npy files
+    :param str outputdir: location of npy files
+    """
+    with segyio.open(filename, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+        segy_stddev = np.sqrt(np.var(data))
+
+    # Check statistics file generated from segy
+    with open(os.path.join(outputdir, prefix + "_stats.json"), "r") as f:
+        metadatastr = f.read()
+
+    metadata = json.loads(metadatastr)
+    stddev = float(metadata["stddev"])
+
+    assert round(stddev) == round(segy_stddev)
--- a/interpretation/deepseismic_interpretation/segyconverter/test/test_util.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/test/test_util.py
@ -0,0 +1,110 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Utility functions for pytest
+"""
+import numpy as np
+import os
+
+import segyio
+
+
+def is_npy(s):
+    """
+    Filter check for npy files
+    :param str s: file path
+    :returns: True if npy
+    :rtype: bool
+    """
+    if s.find(".npy") == -1:
+        return False
+    else:
+        return True
+
+
+def get_npy_files(outputdir):
+    """
+    List npy files
+    :param str outputdir: location of npy files
+    :returns: npy_files
+    :rtype: list
+    """
+    npy_files = os.listdir(outputdir)
+    npy_files = list(filter(is_npy, npy_files))
+    npy_files.sort()
+    return npy_files
+
+
+def build_volume(n_points, npy_files, file_location):
+    """
+    Rebuild volume from npy files. This only works for a vertical column of
+    npy files. If there is a cube of files, then a new algorithm will be required to
+    stitch them back together
+
+    :param int n_points: size of cube expected in npy_files
+    :param list npy_files: list of files to load into vertical volume
+    :param str file_location: directory for npy files to add to array
+    :returns: numpy array created by stacking the npy_file arrays vertically (third axis)
+    :rtype: numpy.array
+    """
+    full_volume_from_file = np.zeros((n_points, n_points, n_points * len(npy_files)), dtype=np.float32)
+    for i, file in enumerate(npy_files):
+        data = np.load(os.path.join(file_location, file))
+        full_volume_from_file[:, :, n_points * i : n_points * (i + 1)] = data
+    return full_volume_from_file
+
+
+def create_segy_file(
+    masklambda, filename, sorting=segyio.TraceSortingFormat.INLINE_SORTING, ilinerange=[10, 50], xlinerange=[100, 300]
+):
+
+    # segyio.spec is the minimum set of values for a valid segy file.
+    spec = segyio.spec()
+    spec.sorting = 2
+    spec.format = 1
+    spec.samples = range(int(10))
+    spec.ilines = range(*map(int, ilinerange))
+    spec.xlines = range(*map(int, xlinerange))
+    print(f"Written to {filename}")
+    print(f"\tinlines: {len(spec.ilines)}")
+    print(f"\tcrosslines: {len(spec.xlines)}")
+
+    with segyio.create(filename, spec) as f:
+        # one inline consists of 50 traces
+        # which in turn consists of 2000 samples
+        step = 0.00001
+        start = step * len(spec.samples)
+        # fill a trace with predictable values: left-of-comma is the inline
+        # number. Immediately right of comma is the crossline number
+        # the rightmost digits is the index of the sample in that trace meaning
+        # looking up an inline's i's jth crosslines' k should be roughly equal
+        # to i.j0k
+        trace = np.linspace(-1, 1, len(spec.samples), True, dtype=np.single)
+        if sorting == segyio.TraceSortingFormat.INLINE_SORTING:
+            # Write the file trace-by-trace and update headers with iline, xline
+            # and offset
+            tr = 0
+            for il in spec.ilines:
+                for xl in spec.xlines:
+                    if masklambda(il, xl):
+                        f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
+                        f.trace[tr] = trace * ((xl / 100.0) + il)
+                        tr += 1
+
+            f.bin.update(tsort=segyio.TraceSortingFormat.INLINE_SORTING)
+        else:
+            # Write the file trace-by-trace and update headers with iline, xline
+            # and offset
+            tr = 0
+            for il in spec.ilines:
+                for xl in spec.xlines:
+                    if masklambda(il, xl):
+                        f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
+                        f.trace[tr] = trace * (xl / 100.0) + il
+                        tr += 1
+
+            f.bin.update(tsort=segyio.TraceSortingFormat.CROSSLINE_SORTING)
+        # Add some noise for clipping and normalization tests
+        f.trace[tr // 2] = trace * ((max(spec.xlines) / 100.0) + max(spec.ilines)) * 20
+        f.trace[tr // 3] = trace * ((min(spec.xlines) / 100.0) + min(spec.ilines)) * 20
+        print(f"\ttraces: {tr}")
--- a/interpretation/deepseismic_interpretation/segyconverter/utils/create_segy.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/utils/create_segy.py
@ -0,0 +1,130 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import segyio
+import numpy as np
+from glob import glob
+from os import listdir
+import os
+import pandas as pd
+import re
+import matplotlib.pyplot as pyplot
+
+
+def parse_trace_headers(segyfile, n_traces):
+    """
+    Parse the segy file trace headers into a pandas dataframe.
+    Column names are defined from segyio internal tracefield
+    One row per trace
+    """
+    # Get all header keys
+    headers = segyio.tracefield.keys
+    # Initialize dataframe with trace id as index and headers as columns
+    df = pd.DataFrame(index=range(1, n_traces + 1), columns=headers.keys())
+    # Fill dataframe with all header values
+    for k, v in headers.items():
+        df[k] = segyfile.attributes(v)[:]
+    return df
+
+
+def parse_text_header(segyfile):
+    """
+    Format segy text header into a readable, clean dict
+    """
+    raw_header = segyio.tools.wrap(segyfile.text[0])
+    # Cut on C*int pattern
+    cut_header = re.split(r"C ", raw_header)[1::]
+    # Remove end of line return
+    text_header = [x.replace("\n", " ") for x in cut_header]
+    text_header[-1] = text_header[-1][:-2]
+    # Format in dict
+    clean_header = {}
+    i = 1
+    for item in text_header:
+        key = "C" + str(i).rjust(2, "0")
+        i += 1
+        clean_header[key] = item
+    return clean_header
+
+
+def show_segy_details(segyfile):
+    with segyio.open(segyfile, ignore_geometry=True) as segy:
+        segydf = parse_trace_headers(segy, segy.tracecount)
+        print(f"Loaded from file {segyfile}")
+        print(f"\tTracecount: {segy.tracecount}")
+        print(f"\tData Shape: {segydf.shape}")
+        print(f"\tSample length: {len(segy.samples)}")
+        pyplot.figure(figsize=(10, 6))
+        pyplot.scatter(segydf[["INLINE_3D"]], segydf[["CROSSLINE_3D"]], marker=",")
+        pyplot.xlabel("inline")
+        pyplot.ylabel("crossline")
+        pyplot.show()
+
+
+def load_segy_with_geometry(segyfile):
+    try:
+        segy = segyio.open(segyfile, ignore_geometry=False)
+        segy.mmap()
+        print(f"Loaded with geometry: {segyfile} :")
+        print(f"\tNum samples per trace: {len(segy.samples)}")
+        print(f"\tNum traces in file: {segy.tracecount}")
+    except ValueError as ex:
+        print(f"Load failed with geometry: {segyfile} :")
+        print(ex)
+
+
+def create_segy_file(
+    masklambda, filename, sorting=segyio.TraceSortingFormat.INLINE_SORTING, ilinerange=[10, 50], xlinerange=[100, 300]
+):
+    spec = segyio.spec()
+
+    # to create a file from nothing, we need to tell segyio about the structure of
+    # the file, i.e. its inline numbers, crossline numbers, etc. You can also add
+    # more structural information, but offsets etc. have sensible defautls. This is
+    # the absolute minimal specification for a N-by-M volume
+    spec.sorting = 2
+    spec.format = 1
+    spec.samples = range(int(10))
+    spec.ilines = range(*map(int, ilinerange))
+    spec.xlines = range(*map(int, xlinerange))
+    print(f"Written to {filename}")
+    print(f"\tinlines: {len(spec.ilines)}")
+    print(f"\tcrosslines: {len(spec.xlines)}")
+
+    with segyio.create(filename, spec) as f:
+        # one inline consists of 50 traces
+        # which in turn consists of 2000 samples
+        step = 0.00001
+        start = step * len(spec.samples)
+        # fill a trace with predictable values: left-of-comma is the inline
+        # number. Immediately right of comma is the crossline number
+        # the rightmost digits is the index of the sample in that trace meaning
+        # looking up an inline's i's jth crosslines' k should be roughly equal
+        # to i.j0k
+        trace = np.linspace(-1, 1, len(spec.samples), True, dtype=np.single)
+
+        if sorting == segyio.TraceSortingFormat.INLINE_SORTING:
+            # Write the file trace-by-trace and update headers with iline, xline
+            # and offset
+            tr = 0
+            for il in spec.ilines:
+                for xl in spec.xlines:
+                    if masklambda(il, xl):
+                        f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
+                        f.trace[tr] = trace * ((xl / 100.0) + il)
+                        tr += 1
+
+            f.bin.update(tsort=segyio.TraceSortingFormat.CROSSLINE_SORTING)
+        else:
+            # Write the file trace-by-trace and update headers with iline, xline
+            # and offset
+            tr = 0
+            for il in spec.ilines:
+                for xl in spec.xlines:
+                    if masklambda(il, xl):
+                        f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
+                        f.trace[tr] = trace + (xl / 100.0) + il
+                        tr += 1
+
+            f.bin.update(tsort=segyio.TraceSortingFormat.INLINE_SORTING)
+        print(f"\ttraces: {tr}")
--- a/interpretation/deepseismic_interpretation/segyconverter/utils/dataprep.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/utils/dataprep.py
@ -0,0 +1,140 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+Utility Script to normalize one cube
+"""
+
+import numpy as np
+
+
+def compute_statistics(stddev: float, mean: float, max_range: float, k: int):
+    """
+        Compute min_clip, max_clip and scale values based on provided stddev, max_range and k values
+        :param stddev: standard deviation value
+        :param max_range: maximum value range
+        :param k: number of standard deviation to be used in normalization
+        :returns: min_clip, max_clip, scale: computed values
+        :rtype: float, float, float
+    """
+    min_clip = mean - k * stddev
+    max_clip = mean + k * stddev
+    scale = max_range / (max_clip - min_clip)
+    return min_clip, max_clip, scale
+
+
+def clip_value(v: float, min_clip: float, max_clip: float):
+    """
+        Clip seismic voxel value
+        :param min_clip: minimum value used for clipping
+        :param max_clip: maximum value used for clipping
+        :returns: clipped value, must be within [min_clip, max_clip]
+        :rtype: float
+    """
+    # Clip value
+    if v > max_clip:
+        v = max_clip
+    if v < min_clip:
+        v = min_clip
+    return v
+
+
+def norm_value(v: float, min_clip: float, max_clip: float, min_range: float, max_range: float, scale: float):
+    """
+        Normalize seismic voxel value to be within [min_range, max_clip] according to
+        statisctics computed previously
+        :param v: value to be normalized
+        :param min_clip: minimum value used for clipping
+        :param max_clip: maximum value used for clipping
+        :param min_range: minium range value
+        :param max_range: maximum range value
+        :param scale: scale value to be used for normalization
+        :returns: normalized value, must be within [min_range, max_range]
+        :rtype: float
+    """
+    offset = -1 * min_clip  # Normalizing - set values between 0 and 1
+    # Clip value
+    v = clip_value(v, min_clip, max_clip)
+    # Scale value
+    v = (v + offset) * scale
+    # This value should ALWAYS be between min_range and max_range here
+    if v > max_range or v < min_range:
+        raise Exception(
+            "normalized value should be within [{0},{1}].\
+             The value was: {2}".format(
+                min_range, max_range, v
+            )
+        )
+    return v
+
+
+def normalize_cube(cube: np.array, min_clip: float, max_clip: float, scale: float, min_range: float, max_range: float):
+    """
+        Normalize cube according to statistics. Normalization implies in clipping and normalize cube.
+        :param cube: 3D array to be normalized
+        :param min_clip: minimum value used for clipping
+        :param max_clip: maximum value used for clipping
+        :param min_range: minium range value
+        :param max_range: maximum range value
+        :param scale: scale value to be used for normalization
+        :returns: normalized 3D array
+        :rtype: numpy array
+    """
+    # Define function for normalization
+    vfunc = np.vectorize(norm_value)
+    # Normalize cube
+    norm_cube = vfunc(cube, min_clip=min_clip, max_clip=max_clip, min_range=min_range, max_range=max_range, scale=scale)
+    return norm_cube
+
+
+def clip_cube(cube: np.array, min_clip: float, max_clip: float):
+    """
+        Clip cube values according to statistics
+        :param min_clip: minimum value used for clipping
+        :param max_clip: maximum value used for clipping
+        :returns: clipped 3D array
+        :rtype: numpy array
+    """
+    # Define function for normalization
+    vfunc = np.vectorize(clip_value)
+    clip_cube = vfunc(cube, min_clip=min_clip, max_clip=max_clip)
+    return clip_cube
+
+
+def apply(
+    cube: np.array, stddev: float, mean: float, k: float, min_range: float, max_range: float, clip=True, normalize=True
+):
+    """
+    Preapre data according to provided parameters. This method will compute satistics and can
+    normalize&clip, just clip, or leave the data as is. 
+    :param cube: 3D array to be normalized
+    :param stddev: standard deviation value
+    :param k: number of standard deviation to be used in normalization
+    :param min_range: minium range value
+    :param max_range: maximum range value
+    :param clip: flag to turn on/off clip
+    :param normalize: flag to turn on/off normalization.
+    :returns: processed 3D array
+    :rtype: numpy array
+    """
+    if np.isnan(np.min(cube)):
+        raise Exception("Cube has NaN value")
+    if stddev == 0.0:
+        raise Exception("Standard deviation must not be zero")
+    if k == 0:
+        raise Exception("k must not be zero")
+
+    # Compute statistics
+    min_clip, max_clip, scale = compute_statistics(stddev=stddev, mean=mean, k=k, max_range=max_range)
+
+    if (clip and normalize) or normalize:
+        # Normalize&clip cube. Note that it is not possible to normalize data without
+        # applying clip operation
+        print("Normalizing and Clipping File")
+        return normalize_cube(
+            cube=cube, min_clip=min_clip, max_clip=max_clip, scale=scale, min_range=min_range, max_range=max_range
+        )
+    elif clip:
+        # Only clip values
+        print("Clipping File")
+        return clip_cube(cube=cube, min_clip=min_clip, max_clip=max_clip)
--- a/interpretation/deepseismic_interpretation/segyconverter/utils/segyextract.py
+++ b/interpretation/deepseismic_interpretation/segyconverter/utils/segyextract.py
@ -0,0 +1,344 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""
+Methods for processing segy files that do not include well formed geometry. In these cases, segyio
+cannot infer the 3D volume of data from the traces so this module needs to do that manually
+"""
+
+import os
+import math
+import segyio
+import pandas as pd
+import numpy as np
+import json
+
+# strings which indicate which slice direction has fewer datapoints, i.e. faster to iterate through
+FAST = "fast"
+SLOW = "slow"
+DEFAULT_VALUE = 255
+
+
+def get_segy_metadata(input_file, iline, xline):
+    """
+    Loads segy file and uses the input inline and crossline byte values to load
+    the trace headers. It determines which inline or crossline the traces
+    start with. SEGY files can be non standard and use other byte location for
+    these values. In that case, the data from this method will be erroneous. It
+    is up to the user to figure out which numbers to use by reading the SEGY text
+    header and finding the byte offsets visually.
+
+    :param str input_file: path to segy file
+    :param int iline: inline byte position
+    :param int xline: crossline byte position
+    :returns: fast_distinct, slow_distinct, trace_headers, samplesize
+    :rtype: list, DataFrame, DataFrame, int
+    """
+
+    with segyio.open(input_file, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+        # Initialize df with trace id as index and headers as columns
+        trace_headers = pd.DataFrame(index=range(0, segy_file.tracecount), columns=["i", "j"])
+
+        # Fill dataframe with all trace headers values
+        trace_headers["i"] = segy_file.attributes(iline)
+        trace_headers["j"] = segy_file.attributes(xline)
+
+        _identify_fast_direction(trace_headers, FAST, SLOW)
+
+        samplesize = len(segy_file.samples)
+        fast_distinct = _remove_duplicates(trace_headers[FAST])
+        slow_distinct = np.unique(trace_headers[SLOW])
+    return fast_distinct, slow_distinct, trace_headers, samplesize
+
+
+def process_segy_data_into_single_array(input_file, output_dir, prefix, iline=189, xline=193):
+    """
+    Open segy file and write all data to single npy array
+    :param str input_file: path to segyfile
+    :param str output_dir: path to directory where npy files will be written/rewritten
+    :param str prefix: prefix to use when writing npy files
+    :param int iline: iline header byte location
+    :param int xline: crossline header byte location
+    :returns: 3 dimentional numpy array of SEGY data
+    :rtype: nparray
+    """
+    fast_distinct, slow_distinct, trace_headers, sampledepth = get_segy_metadata(input_file, iline, xline)
+    with segyio.open(input_file, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+
+        fast_line_space = abs(fast_distinct[1] - fast_distinct[0])
+
+        slow_line_space = abs(slow_distinct[0] - slow_distinct[1])
+        sample_size = len(segy_file.samples)
+        layer_fastmax = max(fast_distinct)
+        layer_fastmin = min(fast_distinct)
+        layer_slowmax = max(slow_distinct)
+        layer_slowmin = min(slow_distinct)
+        layer_trace_ids = trace_headers[
+            (trace_headers.fast >= layer_fastmin)
+            & (trace_headers.fast <= layer_fastmax)
+            & (trace_headers.slow >= layer_slowmin)
+            & (trace_headers.slow <= layer_slowmax)
+        ]
+
+        block = np.full((len(fast_distinct), len(slow_distinct), sampledepth), DEFAULT_VALUE, dtype=np.float32)
+        for _, row in layer_trace_ids.iterrows():
+            block[
+                (row[FAST] - layer_fastmin) // fast_line_space,
+                (row[SLOW] - layer_slowmin) // slow_line_space,
+                0:sample_size,
+            ] = segy_file.trace[row.name]
+
+        np.save(
+            os.path.join(output_dir, "{}_{}_{}_{:05d}".format(prefix, fast_distinct[0], slow_distinct[0], 0)), block
+        )
+        variance = np.var(block)
+        stddev = np.sqrt(variance)
+        mean = np.mean(block)
+
+        with open(os.path.join(output_dir, prefix + "_stats.json"), "w") as f:
+            f.write(json.dumps({"stddev": str(stddev), "mean": str(mean)}))
+        print("Npy files written: 1")
+    return block
+
+
+def process_segy_data(input_file, output_dir, prefix, iline=189, xline=193, n_points=128, stride=128):
+    """
+    Open segy file and write all numpy array files to disk
+    :param str input_file: path to segyfile
+    :param str output_dir: path to directory where npy files will be written/rewritten
+    :param str prefix: prefix to use when writing npy files
+    :param int iline: iline header byte location
+    :param int xline: crossline header byte location
+    :param int n_points: output cube size
+    :param int stride: stride when writing data
+    """
+    fast_indexes, slow_indexes, trace_headers, _ = get_segy_metadata(input_file, iline, xline)
+    with segyio.open(input_file, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+        # Global variance of segy data
+        variance = 0
+        mean = 0
+        sample_count = 0
+        filecount = 0
+        block_size = n_points ** 3
+        for block, i, j, k in _generate_all_blocks(
+            segy_file, n_points, stride, fast_indexes, slow_indexes, trace_headers
+        ):
+            # Getting global variance as sum of local variance
+            if variance == 0:
+                # init
+                variance = np.var(block)
+                mean = np.mean(block)
+                sample_count = block_size
+            else:
+                new_avg = np.mean(block)
+                new_variance = np.var(block)
+                variance = _parallel_variance(mean, sample_count, variance, new_avg, block_size, new_variance)
+                mean = ((mean * sample_count) + np.sum(block)) / (sample_count + block_size)
+                sample_count += block_size
+
+            np.save(os.path.join(output_dir, "{}_{}_{}_{:05d}".format(prefix, i, j, k)), block)
+            filecount += 1
+
+        stddev = np.sqrt(variance)
+        with open(os.path.join(output_dir, prefix + "_stats.json"), "w") as f:
+            f.write(json.dumps({"stddev": stddev, "mean": mean}))
+        print("Npy files written: {}".format(filecount))
+
+
+def process_segy_data_column(input_file, output_dir, prefix, i, j, iline=189, xline=193, n_points=128, stride=128):
+    """
+    Open segy file and write one column of npy files to disk
+    :param str input_file: segy file path
+    :param str output_dir: local output directory for npy files
+    :param str prefix: naming prefix for npy files
+    :param int i: index for column data to extract
+    :param int j: index for column data to extractc
+    :param int iline: header byte location for inline
+    :param int xline: header byte location for crossline
+    :param int n_points: size of cube
+    :param int stride: stride for generating cubes
+    """
+    fast_indexes, slow_indexes, trace_headers, _ = get_segy_metadata(input_file, iline, xline)
+
+    with segyio.open(input_file, ignore_geometry=True) as segy_file:
+        segy_file.mmap()
+        filecount = 0
+        for block, i, j, k in _generate_column_blocks(
+            segy_file, n_points, stride, i, j, fast_indexes, slow_indexes, trace_headers
+        ):
+            np.save(os.path.join(output_dir, "{}_{}_{}_{}".format(prefix, i, j, k)), block)
+            filecount += 1
+        print("Files written: {}".format(filecount))
+
+
+def _parallel_variance(avg_a, count_a, var_a, avg_b, count_b, var_b):
+    """
+    Calculate the new variance based on previous calcuated variance
+    :param float avg_a: overall average
+    :param float count_a: overall count
+    :param float var_a: current variance
+    :param float avg_b: ne average
+    :param float count_b: current count
+    :param float var_b: current variance
+    :returns: new variance
+    :rtype: float
+    """
+    delta = avg_b - avg_a
+    m_a = var_a * (count_a - 1)
+    m_b = var_b * (count_b - 1)
+    M2 = m_a + m_b + delta ** 2 * count_a * count_b / (count_a + count_b)
+    return M2 / (count_a + count_b - 1)
+
+
+def _identify_fast_direction(trace_headers, fastlabel, slowlabel):
+    """
+    Returns the modified dataframe with columns labelled as 'fast' and 'slow'
+    Uses the count of changes in indexes for both columns to determine which one is the fast index
+
+    :param DataFrame trace_headers: dataframe with two columns
+    :param str fastlabel: key label for the fast index
+    :param str slowlabel: key label for the slow index
+    """
+    j_count = 0
+    i_count = 0
+    last_trace = 0
+    slope_run = 5
+    for trace in trace_headers["j"][0:slope_run]:
+        if not last_trace == trace:
+            j_count += 1
+            last_trace = trace
+
+    last_trace = 0
+    for trace in trace_headers["i"][0:slope_run]:
+        if not last_trace == trace:
+            i_count += 1
+            last_trace = trace
+    if i_count < j_count:
+        trace_headers.columns = [fastlabel, slowlabel]
+    else:
+        trace_headers.columns = [slowlabel, fastlabel]
+
+
+def _remove_duplicates(list_of_elements):
+    """
+    Remove duplicates from a list but maintain the order
+    :param list list_of_elements: list to be deduped
+    :returns: list containing a distinct list of elements
+    :rtype: list
+    """
+    seen = set()
+    return [x for x in list_of_elements if not (x in seen or seen.add(x))]
+
+
+def _get_trace_column(n_lines, i, j, trace_headers, fast_distinct, slow_distinct, segyfile):
+    """
+    :param int n_lines: number of voxels to extract in each dimension
+    :param int i: fast index anchor for origin of column
+    :param int j: slow index anchor for origin of column
+    :param DataFrame trace_headers: DataFrame of all trace headers
+    :param list fast_distinct: list of distinct fast headers
+    :param list slow_distinct: list of distinct slow headers
+    :param segyio.file segyfile: segy file object previously opened using segyio
+    :returns: thiscolumn, layer_fastmin, layer_slowmin
+    :rtype: nparray, int, int
+    """
+    layer_fastidxs = fast_distinct[i : i + n_lines]
+    fast_line_space = abs(fast_distinct[1] - fast_distinct[0])
+    layer_slowidxs = slow_distinct[j : j + n_lines]
+    slow_line_space = abs(slow_distinct[0] - slow_distinct[1])
+    sample_size = len(segyfile.samples)
+    sample_chunck_count = math.ceil(sample_size / n_lines)
+    layer_fastmax = max(layer_fastidxs)
+    layer_fastmin = min(layer_fastidxs)
+    layer_slowmax = max(layer_slowidxs)
+    layer_slowmin = min(layer_slowidxs)
+    layer_trace_ids = trace_headers[
+        (trace_headers.fast >= layer_fastmin)
+        & (trace_headers.fast <= layer_fastmax)
+        & (trace_headers.slow >= layer_slowmin)
+        & (trace_headers.slow <= layer_slowmax)
+    ]
+
+    thiscolumn = np.zeros((n_lines, n_lines, sample_chunck_count * n_lines), dtype=np.float32)
+    for _, row in layer_trace_ids.iterrows():
+        thiscolumn[
+            (row[FAST] - layer_fastmin) // fast_line_space,
+            (row[SLOW] - layer_slowmin) // slow_line_space,
+            0:sample_size,
+        ] = segyfile.trace[row.name]
+
+    return thiscolumn, layer_fastmin, layer_slowmin
+
+
+def _generate_column_blocks(segy_file, n_points, stride, i, j, fast_indexes, slow_indexes, trace_headers):
+    """
+    Generate arrays for an open segy file (via segyio)
+    :param segyio.file segy_file: input segy file previously opened using segyio
+    :param int n_points: number of voxels to extract in each dimension
+    :param int stride: overlap for output cubes
+    :param int i: fast index anchor for origin of column
+    :param int j: slow index anchor for origin of column
+    :param list fast_indexes: list of distinct fast headers
+    :param list slow_indexes: list of distinct slow headers
+    :param DataFrame trace_headers: trace headers including fast and slow indexes
+    :returns: thiscolumn, fast_anchor, slow_anchor, k
+    :rtype: nparray, int, int, int
+    """
+
+    sample_size = len(segy_file.samples)
+    thiscolumn, fast_anchor, slow_anchor = _get_trace_column(
+        n_points, i, j, trace_headers, fast_indexes, slow_indexes, segy_file
+    )
+    for k in range(0, sample_size - stride, stride):
+        yield thiscolumn[i : (i + n_points), j : (j + n_points), k : (k + n_points)], fast_anchor, slow_anchor, k
+
+
+def _generate_all_blocks(segy_file, n_points, stride, fast_indexes, slow_indexes, trace_headers):
+    """
+    Generate arrays for an open segy file (via segyio)
+    :param segyio.file segy_file: input segy file previously opened using segyio
+    :param int n_points: number of voxels to extract in each dimension
+    :param int stride: overlap for output cubes
+    :param list fast_indexes: list of distinct fast headers
+    :param list slow_indexes: list of distinct slow headers
+    :param DataFrame trace_headers: trace headers including fast and slow indexes
+    :returns: thiscolumn, fast_anchor, slow_anchor, k
+    :rtype: nparray, int, int, int
+    """
+    slow_size = len(slow_indexes)
+    fast_size = len(fast_indexes)
+    sample_size = len(segy_file.samples)
+
+    # Handle edge case when stride is larger than slow_size and fast_size
+    fast_lim = fast_size
+    slow_lim = slow_size
+    for i in range(0, fast_lim, stride):
+        for j in range(0, slow_lim, stride):
+            thiscolumn, fast_anchor, slow_anchor = _get_trace_column(
+                n_points, i, j, trace_headers, fast_indexes, slow_indexes, segy_file
+            )
+            for k in range(0, sample_size, stride):
+                yield thiscolumn[:, :, k : (k + n_points)], fast_anchor, slow_anchor, k
+
+
+def timewrapper(func, *args, **kwargs):
+    """
+    utility function to pass argumentswhile using the timer module
+    :param function func: function to wrap
+    :param args: parameters accepted by func
+    :param kwargs: optional parameters accepted by func
+    :returns: wrapped
+    :rtype: function
+    """
+
+    def wrapped():
+        """
+        Wrapper function that takes no arguments
+        :returns: func
+        :rtype: function
+        """
+        return func(*args, **kwargs)
+
+    return wrapped
--- a/interpretation/requirements.txt
+++ b/interpretation/requirements.txt
@ -1,3 +1,6 @@
 numpy>=1.17.0
 azure-cli-core
-azureml-sdk==1.0.74
+azureml-sdk==1.0.83
+azureml-contrib-pipeline-steps==1.0.83
+azureml-contrib-services==1.0.83
+python-dotenv==0.10.5
--- a/interpretation/tests/example_config.json
+++ b/interpretation/tests/example_config.json
@ -0,0 +1,25 @@
+{
+    "step1": {
+        "type": "MpiStep",
+        "name": "train step",
+        "script": "train.py",
+        "input_datareference_path": "data/",
+        "input_datareference_name": "ds_test",
+        "input_dataset_name": "deepseismic_test_dataset",
+        "source_directory": "experiments/interpretation/dutchf3_patch",
+        "arguments": [
+            "--cfg",
+            "configs/unet.yaml",
+            "TRAIN.END_EPOCH",
+            "1",
+            "TRAIN.SNAPSHOTS",
+            "1",
+            "DATASET.ROOT",
+            "data"
+        ],
+        "requirements": "experiments/interpretation/dutchf3_patch/azureml_requirements.txt",
+        "node_count": 1,
+        "processes_per_node": 1,
+        "base_image": "pytorch/pytorch"
+    }
+}
--- a/interpretation/tests/test_train_pipeline.py
+++ b/interpretation/tests/test_train_pipeline.py
@ -0,0 +1,97 @@
+"""
+Integration tests for the train pipeline
+"""
+import pytest
+from deepseismic_interpretation.azureml_pipelines.train_pipeline import TrainPipeline
+import json
+import os
+
+TEMP_CONFIG_FILE = "test_batch_config.json"
+test_data = None
+
+
+class TestTrainPipelineIntegration:
+    """
+    Class used for testing the training pipeline
+    """
+
+    global test_data
+    test_data = {
+        "step1": {
+            "type": "MpiStep",
+            "name": "train step",
+            "script": "train.py",
+            "input_datareference_path": "data/",
+            "input_datareference_name": "training_data",
+            "input_dataset_name": "f3_data",
+            "source_directory": "experiments/interpretation/dutchf3_patch",
+            "arguments": ["TRAIN.END_EPOCH", "1"],
+            "requirements": "requirements.txt",
+            "node_count": 1,
+            "processes_per_node": 1,
+        }
+    }
+
+    @pytest.fixture(scope="function", autouse=True)
+    def teardown(self):
+        yield
+        if hasattr(self, "run"):
+            self.run.cancel()
+        os.remove(TEMP_CONFIG_FILE)
+
+    def test_train_pipeline_expected_inputs_submits_correctly(self):
+        # arrange
+        self._setup_test_config()
+        orchestrator = TrainPipeline(
+            "interpretation/tests/example_config.json"
+        )  # updated this to be an example of our configh
+        # act
+        orchestrator.construct_pipeline()
+        self.run = orchestrator.run_pipeline(experiment_name="TEST-train-pipeline")
+
+        # assert
+        assert self.run.get_status() == "Running" or "NotStarted"
+
+    @pytest.mark.parametrize(
+        "step,missing_dependency",
+        [
+            ("step1", "name"),
+            ("step1", "type"),
+            ("step1", "input_datareference_name"),
+            ("step1", "input_datareference_path"),
+            ("step1", "input_dataset_name"),
+            ("step1", "source_directory"),
+            ("step1", "script"),
+            ("step1", "arguments"),
+        ],
+    )
+    def test_missing_dependency_in_config_throws_error(self, step, missing_dependency):
+        # iterates throw all config dependencies leaving them each out
+
+        # arrange
+        self.data = test_data
+        self._create_config_without(step, missing_dependency)
+        self._setup_test_config()
+        orchestrator = TrainPipeline(self.test_config)
+
+        # act / assert
+        with pytest.raises(KeyError):
+            orchestrator.construct_pipeline()
+
+    def _create_config_without(self, step, dependency_to_omit):
+        """
+        helper function that removes dependencies from config file
+
+        :param str step: name of the step with omitted dependency
+        :param str dependency_to_omit: the dependency you want to omit from the config
+        """
+        self.data[step].pop(dependency_to_omit, None)
+
+    def _setup_test_config(self):
+        """
+        helper function that saves the test data in a temp config file
+        """
+        self.data = test_data
+        self.test_config = TEMP_CONFIG_FILE
+        with open(self.test_config, "w") as data_file:
+            json.dump(self.data, data_file)
--- a/scripts/autoformat.sh
+++ b/scripts/autoformat.sh
@ -1,4 +1,6 @@
 #!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.

 # autoformats all files in the repo to black

--- a/scripts/byod_penobscot.py
+++ b/scripts/byod_penobscot.py
@ -0,0 +1,151 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""
+Run example:
+python byod_penobscot.py --filename <input HDF5 file> --outdir <where to output data>
+python prepare_dutchf3.py split_train_val patch   --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
+
+"""
+import sklearn
+
+""" libraries """
+import h5py
+
+import numpy as np
+import os
+
+np.set_printoptions(linewidth=200)
+import logging
+
+# toggle to WARNING when running in production, or use CLI
+logging.getLogger().setLevel(logging.DEBUG)
+# logging.getLogger().setLevel(logging.WARNING)
+import argparse
+
+parser = argparse.ArgumentParser()
+
+""" useful information when running from a GIT folder."""
+myname = os.path.realpath(__file__)
+mypath = os.path.dirname(myname)
+myname = os.path.basename(myname)
+
+
+def main(args):
+    """
+    Transforms Penobscot HDF5 dataset into DeepSeismic Tensor Format
+    """
+
+    logging.info("loading data")
+    f = h5py.File(args.filename, "r")
+    data = f["features"][:, :, :, 0]
+    labels = f["label"][:, :, :]
+    assert labels.min() == 0
+    n_classes = labels.max() + 1
+    assert n_classes == N_CLASSES
+
+    # inline x depth x crossline, make it inline x crossline x depth
+    data = np.swapaxes(data, 1, 2)
+    labels = np.swapaxes(labels, 1, 2)
+
+    # Make data cube fast to access
+    data = np.ascontiguousarray(data, "float32")
+    labels = np.ascontiguousarray(labels, "uint8")
+
+    # combine classes 4 and 5 (index 3 and 4)- shift others down
+    labels[labels > 3] -= 1
+
+    # rescale to be within a certain range
+    range_min, range_max = -1.0, 1.0
+    data_std = (data - data.min()) / (data.max() - data.min())
+    data = data_std * (range_max - range_min) + range_min
+
+    """
+    # cut off a buffer zone around the volume (to avoid mislabeled data):
+    buffer = 25
+    data = data[:, buffer:-buffer, buffer:-buffer]
+    labels = labels[:, buffer:-buffer, buffer:-buffer]
+    """
+
+    # time by crosslines by inlines
+    n_inlines = data.shape[0]
+    n_crosslines = data.shape[1]
+
+    inline_cut = int(np.floor(n_inlines * INLINE_FRACTION))
+    crossline_cut = int(np.floor(n_crosslines * CROSSLINE_FRACTION))
+
+    data_train = data[0:inline_cut, 0:crossline_cut, :]
+    data_test1 = data[inline_cut:n_inlines, :, :]
+    data_test2 = data[:, crossline_cut:n_crosslines, :]
+
+    labels_train = labels[0:inline_cut, 0:crossline_cut, :]
+    labels_test1 = labels[inline_cut:n_inlines, :, :]
+    labels_test2 = labels[:, crossline_cut:n_crosslines, :]
+
+    def mkdir(dirname):
+
+        if os.path.isdir(dirname) and os.path.exists(dirname):
+            return
+
+        if not os.path.isdir(dirname) and os.path.exists(dirname):
+            logging.info("remote file", dirname, "and run this script again")
+
+        os.mkdir(dirname)
+
+    mkdir(args.outdir)
+    mkdir(os.path.join(args.outdir, "splits"))
+    mkdir(os.path.join(args.outdir, "train"))
+    mkdir(os.path.join(args.outdir, "test_once"))
+
+    np.save(os.path.join(args.outdir, "train", "train_seismic.npy"), data_train)
+    np.save(os.path.join(args.outdir, "train", "train_labels.npy"), labels_train)
+
+    np.save(os.path.join(args.outdir, "test_once", "test1_seismic.npy"), data_test1)
+    np.save(os.path.join(args.outdir, "test_once", "test1_labels.npy"), labels_test1)
+
+    np.save(os.path.join(args.outdir, "test_once", "test2_seismic.npy"), data_test2)
+    np.save(os.path.join(args.outdir, "test_once", "test2_labels.npy"), labels_test2)
+
+    # Compute class weights:
+    num_classes, class_count = np.unique(labels[:], return_counts=True)
+    # class_probabilities = np.histogram(labels[:], bins= , density=True)
+    class_weights = 1 - class_count / np.sum(class_count)
+    logging.info("CLASS WEIGHTS TO USE")
+    logging.info(class_weights)
+    logging.info("MEAN")
+    logging.info(data.mean())
+    logging.info("STANDARD DEVIATION")
+    logging.info(data.std())
+
+
+""" GLOBAL VARIABLES """
+INLINE_FRACTION = 0.7
+CROSSLINE_FRACTION = 1.0
+N_CLASSES = 8
+
+parser.add_argument("--filename", help="Name of HDF5 data", type=str, required=True)
+parser.add_argument("--outdir", help="Output data directory location", type=str, required=True)
+
+""" main wrapper with profiler """
+if __name__ == "__main__":
+    main(parser.parse_args())
+
+# pretty printing of the stack
+"""
+  try:
+    logging.info('before main')
+    main(parser.parse_args())
+    logging.info('after main')
+  except:
+    for frame in traceback.extract_tb(sys.exc_info()[2]):
+      fname,lineno,fn,text = frame
+      print ("Error in %s on line %d" % (fname, lineno))
+"""
+# optionally enable profiling information
+#  import cProfile
+#  name = <insert_name_here>
+#  cProfile.run('main.run()', name + '.prof')
+#  import pstats
+#  p = pstats.Stats(name + '.prof')
+#  p.sort_stats('cumulative').print_stats(10)
+#  p.sort_stats('time').print_stats()
--- a/scripts/dev_build.py
+++ b/scripts/dev_build.py
@ -1,4 +1,6 @@
 #!/usr/bin/env python3
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
 """ Please see the def main() function for code description."""
 import time

@ -10,6 +12,8 @@ import sys
 import yaml
 import subprocess

+from datetime import datetime
+
 np.set_printoptions(linewidth=200)
 import logging

@ -32,6 +36,8 @@ def main(args):
    add --setup to run it (destroys existing environment and creates a new one, along with all the data)

    """
+    
+    beg = datetime.now()

    logging.info("loading data")

@ -41,7 +47,7 @@ def main(args):
        logging.info(f"Loaded {file}")

        # run single job
-        job_names = [x["job"] for x in list["jobs"]] if not args.job else args.job.split(',')
+        job_names = [x["job"] for x in list["jobs"]] if not args.job else args.job.split(",")

        if not args.setup and "setup" in job_names:
            job_names.remove("setup")
@ -52,7 +58,7 @@ def main(args):
        current_env = os.environ.copy()
        # modify for conda to work
        # TODO: not sure why on DS VM this does not get picked up from the standard environment
-        current_env["PATH"] = PATH_PREFIX+":"+current_env["PATH"]
+        current_env["PATH"] = PATH_PREFIX + ":" + current_env["PATH"]

        for job in job_list:
            job_name = job["job"]
@ -74,16 +80,16 @@ def main(args):
                    stderr=subprocess.STDOUT,
                    executable=current_env["SHELL"],
                    env=current_env,
-                    cwd=os.getcwd()
+                    cwd=os.getcwd(),
                )
                toc = time.perf_counter()
                print(f"Job time took {(toc-tic)/60:0.2f} minutes")
            except subprocess.CalledProcessError as err:
-                logging.info(f'ERROR: \n{err}')
-                decoded_stdout = err.stdout.decode('utf-8')
+                logging.info(f"ERROR: \n{err}")
+                decoded_stdout = err.stdout.decode("utf-8")
                log_file = "dev_build.latest_error.log"
                logging.info(f"Have {len(err.stdout)} output bytes in {log_file}")
-                with open(log_file, 'w') as log_file:
+                with open(log_file, "w") as log_file:
                    log_file.write(decoded_stdout)
                sys.exit()
            else:
@ -92,8 +98,13 @@ def main(args):

    logging.info(f"Everything ran! You can try running the same jobs {job_names} on the build VM now")

+    end = datetime.now()
+
+    print("time elapsed in seconds", (end - beg).total_seconds())
+
+
 """ GLOBAL VARIABLES """
-PATH_PREFIX = "/data/anaconda/envs/seismic-interpretation/bin:/data/anaconda/bin"
+PATH_PREFIX = "/anaconda/envs/seismic-interpretation/bin:/anaconda/bin"

 parser.add_argument(
    "--file", help="Which yaml file you'd like to read which specifies build info", type=str, required=True
--- a/scripts/env_reinstall.sh
+++ b/scripts/env_reinstall.sh
@ -1,4 +1,6 @@
 #!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.

 conda env remove -n seismic-interpretation
 yes | conda env create -f environment/anaconda/local/environment.yml
--- a/scripts/gen_synthetic_data.py
+++ b/scripts/gen_synthetic_data.py
@ -1,4 +1,7 @@
 #!/usr/bin/env python3
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
 """ Please see the def main() function for code description."""

 """ libraries """
@ -83,14 +86,14 @@ def make_gradient(n_inlines, n_crosslines, n_depth, box_size, dir="inline"):
    :return: numpy array
    """

-    orthogonal_dir = dir # for depth case
-    if dir=='inline':
-        orthogonal_dir = 'crossline'
-    elif dir=='crossline':
-        orthogonal_dir = 'inline'
-    
+    orthogonal_dir = dir  # for depth case
+    if dir == "inline":
+        orthogonal_dir = "crossline"
+    elif dir == "crossline":
+        orthogonal_dir = "inline"
+
    axis = GRADIENT_DIR.index(orthogonal_dir)
-    
+
    n_points = (n_inlines, n_crosslines, n_depth)[axis]
    n_classes = int(np.ceil(float(n_points) / box_size))
    logging.info(f"GRADIENT: we will output {n_classes} classes in the {dir} direction")
@ -127,35 +130,57 @@ def main(args):

    logging.info("loading data")

-    train_seismic = np.load(os.path.join(args.dataroot, "train", "train_seismic.npy"))
-    train_labels = np.load(os.path.join(args.dataroot, "train", "train_labels.npy"))
-    test1_seismic = np.load(os.path.join(args.dataroot, "test_once", "test1_seismic.npy"))
-    test1_labels = np.load(os.path.join(args.dataroot, "test_once", "test1_labels.npy"))
-    test2_seismic = np.load(os.path.join(args.dataroot, "test_once", "test2_seismic.npy"))
-    test2_labels = np.load(os.path.join(args.dataroot, "test_once", "test2_labels.npy"))
+    # TODO: extend this to binary and gradient
+    if args.type != "checkerboard":
+        assert args.based_on == "dutch_f3"

-    assert train_seismic.shape == train_labels.shape
-    assert train_seismic.min() == WHITE
-    assert train_seismic.max() == BLACK
-    assert train_labels.min() == 0
-    # this is the number of classes in Alaudah's Dutch F3 dataset
-    assert train_labels.max() == 5
+    logging.info(f"synthetic data generation based on {args.based_on}")

-    assert test1_seismic.shape == test1_labels.shape
-    assert test1_seismic.min() == WHITE
-    assert test1_seismic.max() == BLACK
-    assert test1_labels.min() == 0
-    # this is the number of classes in Alaudah's Dutch F3 dataset
-    assert test1_labels.max() == 5
+    if args.based_on == "dutch_f3":

-    assert test2_seismic.shape == test2_labels.shape
-    assert test2_seismic.min() == WHITE
-    assert test2_seismic.max() == BLACK
-    assert test2_labels.min() == 0
-    # this is the number of classes in Alaudah's Dutch F3 dataset
-    assert test2_labels.max() == 5
+        train_seismic = np.load(os.path.join(args.dataroot, "train", "train_seismic.npy"))
+        train_labels = np.load(os.path.join(args.dataroot, "train", "train_labels.npy"))
+        test1_seismic = np.load(os.path.join(args.dataroot, "test_once", "test1_seismic.npy"))
+        test1_labels = np.load(os.path.join(args.dataroot, "test_once", "test1_labels.npy"))
+        test2_seismic = np.load(os.path.join(args.dataroot, "test_once", "test2_seismic.npy"))
+        test2_labels = np.load(os.path.join(args.dataroot, "test_once", "test2_labels.npy"))
+
+        assert train_seismic.shape == train_labels.shape
+        assert train_seismic.min() == WHITE
+        assert train_seismic.max() == BLACK
+        assert train_labels.min() == 0
+        # this is the number of classes in Alaudah's Dutch F3 dataset
+        assert train_labels.max() == 5
+
+        assert test1_seismic.shape == test1_labels.shape
+        assert test1_seismic.min() == WHITE
+        assert test1_seismic.max() == BLACK
+        assert test1_labels.min() == 0
+        # this is the number of classes in Alaudah's Dutch F3 dataset
+        assert test1_labels.max() == 5
+
+        assert test2_seismic.shape == test2_labels.shape
+        assert test2_seismic.min() == WHITE
+        assert test2_seismic.max() == BLACK
+        assert test2_labels.min() == 0
+        # this is the number of classes in Alaudah's Dutch F3 dataset
+        assert test2_labels.max() == 5
+    elif args.based_on == "fixed_box_number":
+        logging.info(f"box_number is {args.box_number}")
+        logging.info(f"box_size is {args.box_size}")
+        # Note: this assumes the data is 3D, opening up higher dimensions, this (and other parts of this scrpit)
+        # must be refactored
+        synthetic_shape = (int(args.box_number * args.box_size),) * 3
+        train_seismic = np.ones(synthetic_shape, dtype=float)
+        train_labels = np.ones(synthetic_shape, dtype=int)
+
+        test1_seismic = train_seismic
+        test1_labels = train_labels
+        test2_seismic = train_seismic
+        test2_labels = train_labels

    if args.type == "checkerboard":
+
        logging.info("train checkerbox")
        n_inlines, n_crosslines, n_depth = train_seismic.shape
        checkerboard_train_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
@ -163,23 +188,26 @@ def main(args):
        checkerboard_train_labels = checkerboard_train_seismic.astype(train_labels.dtype)
        # labels are integers and start from zero
        checkerboard_train_labels[checkerboard_train_seismic < WHITE_LABEL] = WHITE_LABEL
-
+        logging.info(f"training data shape {checkerboard_train_seismic.shape}")
        # create checkerbox
        logging.info("test1 checkerbox")
        n_inlines, n_crosslines, n_depth = test1_seismic.shape
+
        checkerboard_test1_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
        checkerboard_test1_seismic = checkerboard_test1_seismic.astype(test1_seismic.dtype)
        checkerboard_test1_labels = checkerboard_test1_seismic.astype(test1_labels.dtype)
        # labels are integers and start from zero
        checkerboard_test1_labels[checkerboard_test1_seismic < WHITE_LABEL] = WHITE_LABEL
-
+        logging.info(f"test1 data shape {checkerboard_test1_seismic.shape}")
        logging.info("test2 checkerbox")
        n_inlines, n_crosslines, n_depth = test2_seismic.shape
+
        checkerboard_test2_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
        checkerboard_test2_seismic = checkerboard_test2_seismic.astype(test2_seismic.dtype)
        checkerboard_test2_labels = checkerboard_test2_seismic.astype(test2_labels.dtype)
        # labels are integers and start from zero
        checkerboard_test2_labels[checkerboard_test2_seismic < WHITE_LABEL] = WHITE_LABEL
+        logging.info(f"test2 data shape {checkerboard_test2_seismic.shape}")

    # substitute gradient dataset instead of checkerboard
    elif args.type == "gradient":
@ -257,10 +285,20 @@ WHITE_LABEL = 0
 BLACK_LABEL = BLACK
 TYPES = ["checkerboard", "gradient", "binary"]
 GRADIENT_DIR = ["inline", "crossline", "depth"]
+METHODS = ["dutch_f3", "fixed_box_number"]

 parser.add_argument("--dataroot", help="Root location of the input data", type=str, required=True)
 parser.add_argument("--dataout", help="Root location of the output data", type=str, required=True)
 parser.add_argument("--box_size", help="Size of the bounding box", type=int, required=False, default=100)
+parser.add_argument(
+    "--based_on",
+    help="This determines the shape of synthetic data array",
+    type=str,
+    required=False,
+    choices=METHODS,
+    default="dutch_f3",
+)
+parser.add_argument("--box_number", help="Number of boxes", type=int, required=False, default=2)
 parser.add_argument(
    "--type", help="Type of data to generate", type=str, required=False, choices=TYPES, default="checkerboard",
 )
--- a/scripts/prepare_dutchf3.py
+++ b/scripts/prepare_dutchf3.py
@ -148,6 +148,10 @@ def split_patch_train_val(

    iline, xline, depth = labels.shape

+    # Since the locations we will save reference the padded volume, we will increase
+    # the depth of the volume by the padding amount (2*patch_size).
+    depth += 2 * patch_size
+
    split_direction = split_direction.lower()
    if split_direction == "inline":
        num_sections, section_length = iline, xline
@ -157,8 +161,10 @@ def split_patch_train_val(
        raise ValueError(f"Unknown split_direction: {split_direction}")

    train_range, val_range = _get_aline_range(num_sections, per_val, section_stride)
-    vert_locations = range(0, depth, patch_stride)
+    buffer = patch_size // 2
+    vert_locations = range(buffer, depth - patch_size - buffer, patch_stride)
    horz_locations = range(0, section_length, patch_stride)
+
    logger.debug(vert_locations)
    logger.debug(horz_locations)

--- a/scripts/prepare_penobscot.py
+++ b/scripts/prepare_penobscot.py
@ -41,6 +41,8 @@ def _copy_files(files_iter, new_dir):


 def _split_train_val_test(partition, val_ratio, test_ratio):
+    logger = logging.getLogger("__name__")
+    logger.warning(f"prepare_penobscot.py does not support padding. Results might be incorrect. ")
    total_samples = len(partition)
    val_samples = math.floor(val_ratio * total_samples)
    test_samples = math.floor(test_ratio * total_samples)
--- a/tests/init.py
+++ b/tests/init.py
@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
--- a/tests/cicd/aml_build.yml
+++ b/tests/cicd/aml_build.yml
@ -1,54 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT License.
-
-# Pull request against these branches will trigger this build
-pr:
- master
- staging
- contrib
-
-# Any commit to this branch will trigger the build.
-trigger:
- master
- staging
- contrib
-
-jobs:
-
-# partially disable setup for now - done manually on build VM
- job: setup
-  timeoutInMinutes: 10
-  displayName: Setup
-  pool:
-    name: deepseismicagentpool
-
-  steps:
-  - bash: |
-      # terminate as soon as any internal script fails
-      set -e
-
-      echo "Running setup..."
-      pwd
-      ls
-      git branch
-      uname -ra
-
-      # ENABLE ALL FOLLOWING CODE WHEN YOU'RE READY TO ADD AML BUILD - disabled right now
-      # ./scripts/env_reinstall.sh 
-      # use hardcoded root for now because not sure how env changes under ADO policy
-      # DATA_ROOT="/home/alfred/data_dynamic"
-      # ./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}
-      # copy your model files like so - using dummy file to illustrate
-      # azcopy --quiet --source:https://$(storagename).blob.core.windows.net/models/model --source-key $(storagekey) --destination /home/alfred/models/your_model_name
-
- job: AML_job_placeholder
-  dependsOn: setup
-  timeoutInMinutes: 5
-  displayName: AML job placeholder
-  pool:
-    name: deepseismicagentpool
-  steps:
-  - bash: |
-      # UNCOMMENT THIS WHEN YOU HAVE UNCOMMENTED THE SETUP JOB
-      # source activate seismic-interpretation
-      echo "TADA!!"
--- a/tests/cicd/main_build.yml
+++ b/tests/cicd/main_build.yml
@ -52,10 +52,17 @@ jobs:

      ./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}

+      # taken from https://zenodo.org/record/3924682
+      # paper https://arxiv.org/abs/1905.04307
+      # TODO: enable when Penobscot is ready to be provided in the repo - rough sequence of steps below
+      # cd scripts
+      # wget -o /dev/null -O dataset.h5 https://zenodo.org/record/3924682/files/dataset.h5?download=1
+      # python byod_penobscot.py --filename dataset.h5 --outdir <where to output data>
+      # python prepare_dutchf3.py split_train_val patch   --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
+
      # copy your model files like so - using dummy file to illustrate
      azcopy --quiet --source:https://$(storagename).blob.core.windows.net/models/model --source-key $(storagekey) --destination /home/alfred/models/your_model_name

-
 ###################################################################################################
 # Stage 2: fast unit tests
 ###################################################################################################
@ -63,7 +70,7 @@ jobs:
 - job: scripts_unit_tests_job
  dependsOn: setup
  timeoutInMinutes: 5
-  displayName: Unit Tests
+  displayName: Generic Unit Tests
  pool:
    name: deepseismicagentpool
  steps:
@ -72,10 +79,35 @@ jobs:
      echo "Starting scripts unit tests"
      source activate seismic-interpretation
      pytest --durations=0 tests/
-      echo "Script unit test job passed"
+
+- job: data_loaders_unit_tests_job
+  dependsOn: scripts_unit_tests_job
+  timeoutInMinutes: 5
+  displayName: Data Loaders Unit Tests
+  pool:
+    name: deepseismicagentpool
+  steps:
+  - bash: |
+      set -e
+      echo "Starting scripts unit tests"
+      source activate seismic-interpretation
+      pytest --durations=0 interpretation/deepseismic_interpretation/dutchf3/tests/
+
+- job: segy_utils_unit_test_job
+  dependsOn: data_loaders_unit_tests_job
+  timeoutInMinutes: 5
+  displayName: SEGY Converter Unit Tests
+  pool:
+    name: deepseismicagentpool
+  steps:
+  - bash: |
+      set -e
+      echo "Starting scripts unit tests"
+      source activate seismic-interpretation
+      pytest --durations=0 interpretation/deepseismic_interpretation/segyconverter/test

 - job: cv_lib_unit_tests_job
-  dependsOn: scripts_unit_tests_job
+  dependsOn: segy_utils_unit_test_job
  timeoutInMinutes: 5
  displayName: cv_lib Unit Tests
  pool:
@ -89,15 +121,15 @@ jobs:
      echo "cv_lib unit test job passed"

 ###################################################################################################
-# Stage 3: Dutch F3 patch models on checkerboard test set: 
+# Stage 3: Patch models on checkerboard test set: 
 #              deconvnet, unet, HRNet patch depth, HRNet section depth
 # CAUTION: reverted these builds to single-GPU leaving new multi-GPU code in to be reverted later
 ###################################################################################################

- job: checkerboard_dutchf3_patch
+- job: checkerboard_patch
  dependsOn: cv_lib_unit_tests_job
-  timeoutInMinutes: 30
-  displayName: Checkerboard Dutch F3 patch local
+  timeoutInMinutes: 15
+  displayName: Checkerboard patch local
  pool:
    name: deepseismicagentpool
  steps:
@ -108,7 +140,7 @@ jobs:
      # disable auto error handling as we flag it manually
      set +e

-      cd experiments/interpretation/dutchf3_patch/local
+      cd experiments/interpretation/dutchf3_patch
          
      # Create a temporary directory to store the statuses
      dir=$(mktemp -d)
@ -119,36 +151,44 @@ jobs:
      pids=
      # export CUDA_VISIBLE_DEVICES=0
      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
-                        'NUM_DEBUG_BATCHES' 50 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
+                        'NUM_DEBUG_BATCHES' 64 \
+                        'TRAIN.END_EPOCH' 13 'TRAIN.SNAPSHOTS' 1 \
                        'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
                        'TRAIN.DEPTH' 'none' \
+                        'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'no_depth' \
                        'WORKERS' 1 \
                        --cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
      pids+=" $!"
      # export CUDA_VISIBLE_DEVICES=1
      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
-                        'NUM_DEBUG_BATCHES' 10 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
+                        'NUM_DEBUG_BATCHES' 64 \
+                        'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
                        'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
                        'TRAIN.DEPTH' 'section' \
+                        'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
                        'WORKERS' 1 \
                        --cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
      pids+=" $!"
      # export CUDA_VISIBLE_DEVICES=2
      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
-                        'NUM_DEBUG_BATCHES' 50 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
+                        'NUM_DEBUG_BATCHES' 64 \
+                        'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
                        'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
                        'TRAIN.DEPTH' 'section' \
+                        'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
                        'WORKERS' 1 \
                        --cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
      pids+=" $!"
      # export CUDA_VISIBLE_DEVICES=3
      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
-                        'NUM_DEBUG_BATCHES' 5 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
+                        'NUM_DEBUG_BATCHES' 64 \
+                        'TRAIN.END_EPOCH' 2 'TRAIN.SNAPSHOTS' 1 \
                        'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
                        'TRAIN.DEPTH' 'section' \
+                        'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
                        'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
                        'WORKERS' 1 \
@ -166,14 +206,21 @@ jobs:

      # Remove the temporary directory
      rm -r "$dir"
+    
+      set -e
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_patch_deconvnet_no_depth.json --step train --train_depth none
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_unet_section_depth.json --step train --train_depth section
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_seresnet_unet_section_depth.json --step train --train_depth section
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_hrnet_section_depth.json --step train --train_depth section
+      set +e

      # check validation set performance
      set -e
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_patch_deconvnet_no_depth.json
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_unet_section_depth.json
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_seresnet_unet_section_depth.json
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_patch_deconvnet_no_depth.json
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_unet_section_depth.json
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_seresnet_unet_section_depth.json
      # TODO: enable HRNet test set metrics when we debug HRNet
-      # python ../../../../tests/cicd/src/check_performance.py --infile metrics_hrnet_section_depth.json
+      # python ../../../tests/cicd/src/check_performance.py --infile metrics_hrnet_section_depth.json
      set +e
      echo "All models finished training - start scoring"

@ -256,173 +303,32 @@ jobs:
      # Remove the temporary directory
      rm -r "$dir"

+      # check data flow for test
+      set -e
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_patch_deconvnet_no_depth.json --step test --train_depth none
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_unet_section_depth.json --step test --train_depth section
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_seresnet_unet_section_depth.json --step test --train_depth section
+      python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_hrnet_section_depth.json --step test --train_depth section
+      set +e
+      
      # check test set performance
      set -e
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_patch_deconvnet_no_depth.json --test
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_unet_section_depth.json --test
-      python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_seresnet_unet_section_depth.json --test
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_test_patch_deconvnet_no_depth.json --test
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_test_unet_section_depth.json --test
+      python ../../../tests/cicd/src/check_performance.py --infile metrics_test_seresnet_unet_section_depth.json --test
      # TODO: enable HRNet test set metrics when we debug HRNet
-      # python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_hrnet_section_depth.json --test
+      # python ../../../tests/cicd/src/check_performance.py --infile metrics_test_hrnet_section_depth.json --test
      
      echo "PASSED"


-###################################################################################################
-# Stage 3: Dutch F3 patch models: deconvnet, unet, HRNet patch depth, HRNet section depth
-# CAUTION: reverted these builds to single-GPU leaving new multi-GPU code in to be reverted later
-###################################################################################################
-
- job: dutchf3_patch
-  dependsOn: checkerboard_dutchf3_patch
-  timeoutInMinutes: 60
-  displayName: Dutch F3 patch local
-  pool:
-    name: deepseismicagentpool
-  steps:
-  - bash: |
-
-      source activate seismic-interpretation
-
-      # disable auto error handling as we flag it manually
-      set +e
-
-      cd experiments/interpretation/dutchf3_patch/local
-
-      # Create a temporary directory to store the statuses
-      dir=$(mktemp -d)
-
-      pids=
-      # export CUDA_VISIBLE_DEVICES=0
-      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
-                        'TRAIN.DEPTH' 'none' \
-                        'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
-                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'no_depth' \
-                        'WORKERS' 1 \
-                        --cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=1
-      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
-                        'TRAIN.DEPTH' 'section' \
-                        'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
-                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
-                        'WORKERS' 1 \
-                        --cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=2
-      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
-                        'TRAIN.DEPTH' 'section' \
-                        'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
-                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
-                        'WORKERS' 1 \
-                        --cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=3
-      { python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
-                        'TRAIN.DEPTH' 'section' \
-                        'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
-                        'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
-                        'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
-                        'WORKERS' 1 \
-                        --cfg=configs/hrnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-
-      wait $pids || exit 1
-
-      # check if any of the models had an error during execution
-      # Get return information for each pid
-      for file in "$dir"/*; do
-        printf 'PID %d returned %d\n' "${file##*/}" "$(<"$file")"
-        [[ "$(<"$file")" -ne "0" ]] && exit 1 || echo "pass"
-      done
-
-      # Remove the temporary directory
-      rm -r "$dir"
-
-      echo "All models finished training - start scoring"
-
-      # Create a temporary directory to store the statuses
-      dir=$(mktemp -d)
-
-      pids=
-      # export CUDA_VISIBLE_DEVICES=0
-      # find the latest model which we just trained
-      # if we're running on a build VM
-      model_dir=$(ls -td output/patch_deconvnet/no_depth/* | head -1)
-      # if we're running in a checked out git repo
-      [[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/patch_deconvnet/no_depth/* | head -1)
-      model=$(ls -t ${model_dir}/*.pth | head -1)
-      # try running the test script
-      { python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
-                       'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'no_depth' \
-                       'TEST.MODEL_PATH' ${model} \
-                       'WORKERS' 1 \
-                       --cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=1
-      # find the latest model which we just trained
-      # if we're running on a build VM
-      model_dir=$(ls -td output/unet/section_depth/* | head -1)
-      # if we're running in a checked out git repo
-      [[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/unet/section_depth* | head -1)
-      model=$(ls -t ${model_dir}/*.pth | head -1)
-      # try running the test script
-      { python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
-                       'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
-                       'TEST.MODEL_PATH' ${model} \
-                       'WORKERS' 1 \
-                       --cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=2
-      # find the latest model which we just trained
-      # if we're running on a build VM
-      model_dir=$(ls -td output/seresnet_unet/section_depth/* | head -1)
-      # if we're running in a checked out git repo
-      [[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/seresnet_unet/section_depth/* | head -1)
-      model=$(ls -t ${model_dir}/*.pth | head -1)
-      # try running the test script
-      { python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
-                       'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
-                       'TEST.MODEL_PATH' ${model} \
-                       'WORKERS' 1 \
-                       --cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-      # export CUDA_VISIBLE_DEVICES=3
-      # find the latest model which we just trained
-      # if we're running on a build VM
-      model_dir=$(ls -td output/hrnet/section_depth/* | head -1)
-      # if we're running in a checked out git repo
-      [[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/hrnet/section_depth/* | head -1)
-      model=$(ls -t ${model_dir}/*.pth | head -1)
-      # try running the test script
-      { python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
-                       'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
-                       'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
-                       'TEST.MODEL_PATH' ${model} \
-                       'WORKERS' 1 \
-                       --cfg=configs/hrnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
-      pids+=" $!"
-
-      # wait for completion
-      wait $pids || exit 1
-
-      # check if any of the models had an error during execution
-      # Get return information for each pid
-      for file in "$dir"/*; do
-        printf 'PID %d returned %d\n' "${file##*/}" "$(<"$file")"
-        [[ "$(<"$file")" -ne "0" ]] && exit 1 || echo "pass"
-      done
-
-      # Remove the temporary directory
-      rm -r "$dir"
-
-      echo "PASSED"

 ###################################################################################################
-# Stage 5: Notebook tests
+# Stage 4: Notebook tests
 ###################################################################################################

 - job: F3_block_training_and_evaluation_local_notebook
-  dependsOn: dutchf3_patch
+  dependsOn: checkerboard_patch
  timeoutInMinutes: 5
  displayName: F3 block training and evaluation local notebook
  pool:
@ -434,3 +340,40 @@ jobs:
        --nbname examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb \
        --dataset_root /home/alfred/data_dynamic/dutch_f3/data \
        --model_pretrained download
+
+- job: segyconverter_notebooks
+  dependsOn: F3_block_training_and_evaluation_local_notebook
+  timeoutInMinutes: 5
+  displayName: SEGY converter notebooks
+  pool:
+    name: deepseismicagentpool
+  steps:
+  - bash: |
+      source activate seismic-interpretation
+      pytest -s tests/cicd/src/notebook_integration_tests.py \
+        --nbname examples/interpretation/segyconverter/01_segy_sample_files.ipynb \
+        --cwd examples/interpretation/segyconverter
+      pytest -s tests/cicd/src/notebook_integration_tests.py \
+        --nbname examples/interpretation/segyconverter/02_segy_convert_sample.ipynb \
+        --cwd examples/interpretation/segyconverter
+
+###################################################################################################
+# Stage 5: Docker tests
+###################################################################################################
+- job: docker_build_test
+  dependsOn: segyconverter_notebooks
+  timeoutInMinutes: 30
+  displayName: Docker build test
+  pool:
+    name: deepseismicagentpool
+  steps:
+  - bash: |
+      set -e
+      echo "build docker"
+      cd docker
+      pwd
+      docker images | grep "seismic-deeplearning" | awk '{print $1 ":" $2}' | xargs docker rmi || echo "pass if no seismic-deeplearning image is found"
+      docker images | grep "<none>" | awk '{print $1 ":" $2}' | xargs docker rmi || echo "pass if no non-tagged images is found"
+      docker build -t seismic-deeplearning . 
+
+
--- a/tests/cicd/src/check_data_flow.py
+++ b/tests/cicd/src/check_data_flow.py
@ -0,0 +1,183 @@
+#!/usr/bin/env python3
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+""" Please see the def main() function for code description."""
+import json
+
+
+""" libraries """
+
+import numpy as np
+import os
+
+np.set_printoptions(linewidth=200)
+import logging
+
+# toggle to WARNING when running in production, or use CLI
+logging.getLogger().setLevel(logging.DEBUG)
+# logging.getLogger().setLevel(logging.WARNING)
+import argparse
+
+parser = argparse.ArgumentParser()
+
+""" useful information when running from a GIT folder."""
+myname = os.path.realpath(__file__)
+mypath = os.path.dirname(myname)
+myname = os.path.basename(myname)
+
+
+def main(args):
+    """
+
+    Tests to ensure proper data flow throughout the experiments.
+
+    """
+
+    logging.info("loading data")
+
+    with open(args.infile, "r") as fp:
+        data = json.load(fp)
+
+    # Note: these are specific to the setup in
+    # main_build.yml for train.py
+    # and get_data_for_builds.sh and prepare_dutchf3.py and prepare_dutchf3.py
+
+    if args.step == "test":
+
+        for test_key in data.keys():
+            if args.train_depth == "none":
+                expected_test_input_shape = (200, 200, 200)
+                expected_img = (1, 1, 200, 200)
+
+            elif args.train_depth == "section":
+                expected_test_input_shape = (200, 3, 200, 200)
+                expected_img = (1, 3, 200, 200)
+
+            elif args.train_depth == "patch":
+                expected_test_input_shape = "TBD"
+                expected_img = "TBD"
+                raise Exception("Must be added")
+
+            msg = f"Expected {expected_test_input_shape} for shape, received {tuple(data[test_key]['test_input_shape'])} instead, in {args.infile.split('.')[0]}"
+            assert tuple(data[test_key]["test_input_shape"]) == expected_test_input_shape, msg
+
+            expected_test_label_shape = (200, 200, 200)
+            msg = f"Expected {expected_test_label_shape} for shape, received {tuple(data[test_key]['test_label_shape'])} instead, in {args.infile.split('.')[0]}"
+            assert tuple(data[test_key]["test_label_shape"]) == expected_test_label_shape, msg
+
+            for img in data[test_key]["img_shape"]:
+                msg = (
+                    f"Expected {expected_img} for shape, received {tuple(img)} instead, in {args.infile.split('.')[0]}"
+                )
+                assert tuple(img) == expected_img, msg
+
+            # -----------------------------------------------
+            exp_n_section = data[test_key]["take_n_sections"]
+            pred_shape_len = len(data[test_key]["pred_shape"])
+            msg = f"Expected {exp_n_section} number of items, received {pred_shape_len} instead, in {args.infile.split('.')[0]}"
+            assert pred_shape_len == exp_n_section, msg
+
+            gt_shape_len = len(data[test_key]["gt_shape"])
+            msg = f"Expected {exp_n_section} number of items, received {gt_shape_len} instead, in {args.infile.split('.')[0]}"
+            assert gt_shape_len == exp_n_section, msg
+
+            img_shape_len = len(data[test_key]["img_shape"])
+            msg = f"Expected {exp_n_section} number of items, received {img_shape_len} instead, in {args.infile.split('.')[0]}"
+            assert img_shape_len == exp_n_section, msg
+
+            expected_len = 400
+            lhs_assertion = data[test_key]["test_section_loader_length"]
+            msg = f"Expected {expected_len} for test section loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
+            assert lhs_assertion == expected_len, msg
+
+            lhs_assertion = data[test_key]["test_loader_length"]
+            msg = f"Expected {expected_len} for test loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
+            assert lhs_assertion == expected_len, msg
+
+            expected_n_classes = 2
+            lhs_assertion = data[test_key]["n_classes"]
+            msg = f"Expected {expected_n_classes} for test loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
+            assert lhs_assertion == expected_n_classes, msg
+
+            expected_pred = (1, 200, 200)
+            expected_gt = (1, 1, 200, 200)
+
+            for pred, gt in zip(data[test_key]["pred_shape"], data[test_key]["gt_shape"]):
+                # dimenstion
+                msg = f"Expected {expected_pred} for prediction shape, received {tuple(pred[0])} instead, in {args.infile.split('.')[0]}"
+                assert tuple(pred[0]) == expected_pred, msg
+
+                # unique classes
+                msg = f"Expected up to {expected_n_classes} unique prediction classes, received {pred[1]} instead, in {args.infile.split('.')[0]}"
+                assert pred[1] <= expected_n_classes, msg
+
+                # dimenstion
+                msg = f"Expected {expected_gt} for ground truth mask shape, received {tuple(gt[0])} instead, in {args.infile.split('.')[0]}"
+                assert tuple(gt[0]) == expected_gt, msg
+
+                # unique classes
+                msg = f"Expected up to {expected_n_classes} unique ground truth classes, received {gt[1]} instead, in {args.infile.split('.')[0]}"
+                assert gt[1] <= expected_n_classes, msg
+
+    elif args.step == "train":
+        if args.train_depth == "none":
+            expected_shape_in = (200, 200, 400)
+        elif args.train_depth == "section":
+            expected_shape_in = (200, 3, 200, 400)
+        elif args.train_depth == "patch":
+            expected_shape_in = "TBD"
+            raise Exception("Must be added")
+
+        msg = f"Expected {expected_shape_in} for shape, received {tuple(data['train_input_shape'])} instead, in {args.infile.split('.')[0]}"
+        assert tuple(data["train_input_shape"]) == expected_shape_in, msg
+
+        expected_shape_label = (200, 200, 400)
+        msg = f"Expected {expected_shape_label} for shape, received {tuple(data['train_label_shape'])} instead, in {args.infile.split('.')[0]}"
+        assert tuple(data["train_label_shape"]) == expected_shape_label, msg
+
+        expected_len = 64
+        msg = f"Expected {expected_len} for train patch loader length, received {data['train_patch_loader_length']} instead, in {args.infile.split('.')[0]}"
+        assert data["train_patch_loader_length"] == expected_len, msg
+
+        expected_len = 1280
+        msg = f"Expected {expected_len} for validation patch loader length, received {data['validation_patch_loader_length']} instead, in {args.infile.split('.')[0]}"
+        assert data["validation_patch_loader_length"] == expected_len, msg
+
+        expected_len = 64
+        msg = f"Expected {expected_len} for train subset length, received {data['train_length_subset']} instead, in {args.infile.split('.')[0]}"
+        assert data["train_length_subset"] == expected_len, msg
+
+        expected_len = 32
+        msg = f"Expected {expected_len} for validation subset length, received {data['validation_length_subset']} instead, in {args.infile.split('.')[0]}"
+        assert data["validation_length_subset"] == expected_len, msg
+
+        expected_len = 4
+        msg = f"Expected {expected_len} for train loader length, received {data['train_loader_length']} instead, in {args.infile.split('.')[0]}"
+        assert data["train_loader_length"] == expected_len, msg
+
+        expected_len = 1
+        msg = f"Expected {expected_len} for train loader length, received {data['train_loader_length']} instead, in {args.infile.split('.')[0]}"
+        assert data["validation_loader_length"] == expected_len, msg
+
+        expected_n_classes = 2
+        msg = f"Expected {expected_n_classes} for number of classes, received {data['n_classes']} instead, in {args.infile.split('.')[0]}"
+        assert data["n_classes"] == expected_n_classes, msg
+
+    logging.info("all done")
+
+
+""" cmd-line arguments """
+STEPS = ["test", "train"]
+TRAIN_DEPTH = ["none", "patch", "section"]
+
+parser.add_argument("--infile", help="Location of the file which has the metrics", type=str, required=True)
+parser.add_argument(
+    "--step", choices=STEPS, type=str, required=True, help="Data flow checks for test or training pipeline"
+)
+parser.add_argument(
+    "--train_depth", choices=TRAIN_DEPTH, type=str, required=True, help="Train depth flag, to check the dimensions"
+)
+""" main wrapper with profiler """
+if __name__ == "__main__":
+    main(parser.parse_args())
--- a/tests/cicd/src/check_performance.py
+++ b/tests/cicd/src/check_performance.py
@ -1,4 +1,7 @@
 #!/usr/bin/env python3
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
 """ Please see the def main() function for code description."""
 import json
 import math
@ -43,27 +46,20 @@ def main(args):
    if args.test:
        metrics_dict["Pixel Accuracy"] = "Pixel Acc: "
        metrics_dict["Mean IoU"] = "Mean IoU: "
-    else:
+    else:  # validation
        metrics_dict["Pixel Accuracy"] = "pixacc"
        metrics_dict["Mean IoU"] = "mIoU"

-    # process training set results
-    assert data[metrics_dict["Pixel Accuracy"]] > 0.0
-    assert data[metrics_dict["Pixel Accuracy"]] <= 1.0
-    assert data[metrics_dict["Mean IoU"]] > 0.0
-    assert data[metrics_dict["Mean IoU"]] <= 1.0
-
    # check for actual values
-    math.isclose(data[metrics_dict["Pixel Accuracy"]], 1.0, abs_tol=ABS_TOL)
-    math.isclose(data[metrics_dict["Mean IoU"]], 1.0, abs_tol=ABS_TOL)
+    assert data[metrics_dict["Pixel Accuracy"]] > 0.97
+    assert data[metrics_dict["Mean IoU"]] > 0.97
+
+    assert data[metrics_dict["Pixel Accuracy"]] <= 1.0
+    assert data[metrics_dict["Mean IoU"]] <= 1.0

    logging.info("all done")


-""" GLOBAL VARIABLES """
-# tolerance within which values are compared
-ABS_TOL = 1e-3
-
 """ cmd-line arguments """
 parser.add_argument("--infile", help="Location of the file which has the metrics", type=str, required=True)
 parser.add_argument(
--- a/tests/cicd/src/conftest.py
+++ b/tests/cicd/src/conftest.py
@ -20,14 +20,17 @@ def nbname(request):
 def dataset_root(request):
    return request.config.getoption("--dataset_root")

+
@pytest.fixture
 def model_pretrained(request):
    return request.config.getoption("--model_pretrained")

+
@pytest.fixture
 def cwd(request):
    return request.config.getoption("--cwd")

+
 """
 def pytest_generate_tests(metafunc):
    # This is called for every test. Only get/set command line arguments
--- a/tests/cicd/src/scripts/get_data_for_builds.sh
+++ b/tests/cicd/src/scripts/get_data_for_builds.sh
@ -38,7 +38,7 @@ DATA_F3="${DATA_F3}/data"

 cd scripts

-python gen_checkerboard.py --dataroot ${DATA_F3} --dataout ${DATA_CHECKERBOARD}
+python gen_synthetic_data.py --dataroot ${DATA_F3} --dataout ${DATA_CHECKERBOARD} --type checkerboard --based_on fixed_box_number

 # finished data download and generation

@ -50,4 +50,6 @@ python prepare_dutchf3.py split_train_val patch   --data_dir=${DATA_F3} --label_
 DATA_CHECKERBOARD="${DATA_CHECKERBOARD}/data"
 # repeat for checkerboard dataset
 python prepare_dutchf3.py split_train_val section --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --split_direction=both
-python prepare_dutchf3.py split_train_val patch   --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both
+python prepare_dutchf3.py split_train_val patch   --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
+
+
				`@ -0,0 +1 @@`
				`../../../interpretation/deepseismic_interpretation/segyconverter/convert_segy.py`