* cleaning up files which are no longer needed

* fixes after removing forking workflow (#322)

* PR to resolve merge issues

* updated main build as well

* added ability to read in git branch name directly

* manually updated the other files

* fixed number of classes for main build tests (#327)

* fixed number of classes for main build tests

* corrected DATASET.ROOT in builds

* added dev build script

* Fixes for development inside the docker container (#335)

* Fix the mound command for the HRNet pretrained model in the docker readme

* Properly catch InvalidGitRepository exception

* make repo paths consistent with non-docker runs -- this way configs paths do not need to be changed

* Properly catch InvalidGitRepository exception in train.py

* Readme update (#337)

* README updates

* Removing user specific path from config

Authored-by: Fatemeh Zamanian <Fatemeh.Zamanian@microsoft.com>

* Fixing #324 and #325 (#338)

* update colormap to a non-discrete one -- fixes #324

* fix mask_to_disk to normalize by n_classes

* changes to test.py

* Updating data.py

* bug fix

* increased timeout time for main_build

* retrigger build

* retrigger the build

* increase timeout

* fixes 318 (#339)

* finished 318

* increased checkerboard test timeout

* fix 333 (#340)

* added label correction to train gradient

* changing the gradient data generator to take inline/crossline argument conssistent with the patchloader

* changing variable name to be more descriptive


Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* bug fix to model predictions (#345)

* replace hrnet with seresnet in experiments - provides stable default model (#343)

* PR to fix #342 (#347)

* intermediate work for normalization

* 1) normalize function runs based on global MIN and MAX 2) has a error handling for division by zero, np.finfo 3) decode_segmap normalizes the label/mask based on the n_calsses

* global normalization added to test.py

* increasing the threshold on timeout

* trigger

* revert

* idk what happened

* increase timeout

* picking up global min and max

* passing config to TrainPatchLoader to facilitate access to global min and max and other attr in low level functions, WIP

* removed print statement

* changed section loaders

* updated test for min and max from config too

* adde MIN and MAX to config

* notebook modified for loaders

* another dataloader in notebook

* readme update

* changed the default values for min max, updated the docstring for loaders, removed suppressed lines

* debug

* merging work from CSE team into main staging branch (#357)

* Adding content to interpretation README (#171)

* added sharat, weehyong to authors

* adding a download script for Dutch F3 dataset

* Adding script instructions for dutch f3

* Update README.md

prepare scripts expect root level directory for dutch f3 dataset. (it is downloaded into $dir/data by the script)

* Adding readme text for the notebooks and checking if config is correctly setup

* fixing prepare script example

* Adding more content to interpretation README

* Update README.md

* Update HRNet_Penobscot_demo_notebook.ipynb

Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* Updates to prepare dutchf3 (#185)

* updating patch to patch_size when we are using it as an integer

* modifying the range function in the prepare_dutchf3 script to get all of our data

* updating path to logging.config so the script can locate it

* manually reverting back log path to troubleshoot build tests

* updating patch to patch_size for testing on preprocessing scripts

* updating patch to patch_size where applicable in ablation.sh

* reverting back changes on ablation.sh to validate build pass

* update patch to patch_size in ablation.sh (#191)

Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com>

* TestLoader's support for custom paths (#196)

* Add testloader support for custom paths.

* Add test

* added file name workaround for Train*Loader classes

* adding comments and clean up

* Remove legacy code.

* Remove parameters that dont exist in init() from documentation.

* Add unit tests for data loaders in dutchf3

* moved unit tests

Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* select contiguous data splits for val and train (#200)

* select contiguous data splits for test and train

* changed data-dir to data_dir as arg to prepare_dutchf3.py

* update script with new required parameter label_file

* ignoring split_alaudah_et_al_19 as it is not updated

* changed TEST to VALIDATION for clarity in the code

* included job to run scripts unit test

* Fix val/train split and add tests

* adjust to consider the whole horz_lines

* update environment - gitpython version

* Segy Converter Utility (#199)

* Add convert_segy utility script and related notebooks

* add segy files to .gitignore

* readability update

* Create methods for normalizing and clipping separately.

* Add comment

* update file paths

* cleanup tests and terminology for the normalization/clipping code

* update notes to provide more context for using the script

* Add tests for clipping.

* Update comments

* added Microsoft copyright

* Update root README

* Add a flag to turn on clipping in dataprep script.

* Remove hard coded values and fix _filder_data method.

* Fix some minor issues pointed out on comments.

* Remove unused lib.

* Rename notebooks to impose order; set env; move all def funtions into utils; improve comments in notebooks; and include code example to run prepare_dutchf3.py

* Label missing data with 255.

* Remove cell with --help command.

* Add notebooks to test pipeline.

* grammer edits

* update notebook output and utils naming

* fix output dir error and cleanup notebook

* fix yaml indent error in notebooks_build.yml

* fix merge issues and job name errors

* debugging the build pipeline

* combine notebook tests for segy converter since they are dependent on each other

Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com>

* Azureml train pipeline (#195)

* initial add of azure ml pipeline

* update references and dependencies

* fix integration tests

* remove incomplete tests

* add azureml requirements.txt for dutchf3 local patch and update pipeline config

* add empty __init__.py to cv_lib dutchf3

* Get train,py to run in pipeline

* allow output dir in train.py

* Clean up README and __init__

* only pass output if available and use input dir for output in train.py

* update comment in train.py

* updating azureml_requirements to only pull from /master

* removing windows guidance in azureml_pipelines/README.md

* adding .env.example

* adding azureml config example

* updating documentation in azureml_pipelines README.md

* updating main README.md to refer to AML guidance documentation

* updating AML README.md to include additional guidance to cancel runs

* adding documentation on AzureML pipelines in the AML README.me

* adding files needed section for AML training run

* including hyperlink in format poiniting to additional detail on Azure Machine Learning pipeslines in AML README.md

* removing the mention of VSCode in the AML README.md

* fixing typo

* modifying config to pipeline configuration in README.md

* fixing typo in README.md

* adding documentation on how to create a blob container and copy data onto it

* adding documentation on blob storage guidance

* adding guidance on how to get the subscription id

* adding guidance to activate environment and then run the kick off train pipeline from ROOT

* adding ability to pass in experiement name and different pipeline configuration to kickoff_train_pipeline.py

* adding Microsoft Corporation Copyright to kickoff_train_pipeline.py

* fixing format in README.md

* adding trouble shooting section in README.md for connection to subscription

* updating troubleshooting title

* adding guidance on how to download the config.json from the Azure Portal in the README.md

* adding additional guidance and information on AzureML compute targets and naming conventions

* changing the configuation file example to only include the train step that is currently supported

* updating config to pipeline configuration when applicable

* adding link to Microsoft docs for additional information on pipeline steps

* updated AML test build definitions

* updated AML test build definitions

* adding job to aml_build.yml

* updating example config for testing

* modifying the test_train_pipeline.py to have appropriate number of pipeline steps and other required modifications

* updating AML_pipeline_tests in aml_build.yml to consume environment variables

* updating scriptType, sciptLocation, and inlineScript in aml_build.yml

* trivial commit to re-trigger broken build pipelines

* fix to aml yml build to use env vars for secrets and everything else

* another yml fix

* another yml fix

* reverting structure format of jobs for aml_build pipeline tests

* updating path to test_train_pipeline.py

* aml_pipeline_tests timed out, extending timeoutInMinutes from 10 to 40

* adding additional pytest

* adding az login

* updating variables in aml pipeline tests

Co-authored-by: Anna Zietlow <annamzietlow@gmail.com>
Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* moved contrib contributions around from CSE

* fixed dataloader tests - updated them to work with new code from staging branch

* segyconverter notebooks and tests run and pass; updated documentation

* added test job for segy converter notebooks

* removed AML training pipeline from this release

* fixed training model tolerance precision in the tests - wasn't working

* fixed train.py build issues after the merge

* addressed PR comments

* fixed bug in check_performance

Co-authored-by: Sharat Chikkerur <sharat.chikkerur@microsoft.com>
Co-authored-by: kirasoderstrom <kirasoderstrom@gmail.com>
Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com>
Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com>
Co-authored-by: Ricardo Squassina Lee <8495707+squassina@users.noreply.github.com>
Co-authored-by: Michael Zawacki <mikezawacki@hotmail.com>
Co-authored-by: Anna Zietlow <annamzietlow@gmail.com>

* make tests simpler (#368)

* removed Dutch F3 job from main_build

* fixed a bug in data subset in debug mode

* modified epoch numbers to pass the performance checks, checkedout check_performance from Max's branch

* modified get_data_for_builds.sh to set up checkerboard data for smaller size, minor improvements on gen_checkerboard

* send all the batches, disabled the performance checks for patch_deconvnet

* added comment to enable tests for patch_deconvnet after debugging, renamed gen_checkerboard, added options to new arg per Max's suggestion

* Replace HRNet with SEResNet model in the notebook (#362)

* replaced HRNet with SEResNet model in the notebook

* removed debugging cell info

* fixed bug where resnet_unet model wasn't loading the pre-trained version in the notebook

* fixed build VM problems

* Multi-GPU training support (#359)

* Data flow tests (#375)

* renamed checkerboard job name

* restructured default outputs from test.py to be dumped under output dir and not debug dir

* test.py output re-org

* removed outdated variable from check_performance.py

* intermediate work

* intermediate work

* bunch of intermediate works

* changing args for different trainings

* final to run dev_build"

* remove print statements

* removed print statement

* removed suppressed lines

* added assertion error msg

* added assertion error msg, one intential bug to test

* testing a stupid bug

* debug

* omg

* final

* trigger build

* fixed multi-GPU termination in train.py (#379)

* PR to fix #371 and #372  (#380)

* added learning rate to logs

* changed epoch for patch_deconvnet, and enabled the tests

* removed TODOs

* changed tensorflow pinned version (#387)

* changed tensorflow pinned version

* trigger build

* closes 385 (#389)

* Fixing #259 by adding symmetric padding along depth direction  (#386)

* BYOD Penobscot (#390)

* minor updates to files

* added penobscot conversion code

* docker build test (#388)

* added a new job to test bulding the docker, for now it is daisy-chained to the end

* this is just a TEST

* test

* test

* remove old image

* debug

* debug

* test

* debug

* enabled all the jobs

* quick fix

* removing non-tagged iamges

Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* added missing license headers and fixed formatting (#391)

* added missing license headers and fixed formatting

* some more license headers

* updated documentation to close 354 and 381 (#392)

* fix test.py and notebook issues (#394)

* resolved conflicts for 0.2 release (#396)

* V00.01.00003 release (#356)

* cleaning up files which are no longer needed

* fixes after removing forking workflow (#322)

* PR to resolve merge issues

* updated main build as well

* added ability to read in git branch name directly

* manually updated the other files

* fixed number of classes for main build tests (#327)

* fixed number of classes for main build tests

* corrected DATASET.ROOT in builds

* added dev build script

* Fixes for development inside the docker container (#335)

* Fix the mound command for the HRNet pretrained model in the docker readme

* Properly catch InvalidGitRepository exception

* make repo paths consistent with non-docker runs -- this way configs paths do not need to be changed

* Properly catch InvalidGitRepository exception in train.py

* Readme update (#337)

* README updates

* Removing user specific path from config

Authored-by: Fatemeh Zamanian <Fatemeh.Zamanian@microsoft.com>

* Fixing #324 and #325 (#338)

* update colormap to a non-discrete one -- fixes #324

* fix mask_to_disk to normalize by n_classes

* changes to test.py

* Updating data.py

* bug fix

* increased timeout time for main_build

* retrigger build

* retrigger the build

* increase timeout

* fixes 318 (#339)

* finished 318

* increased checkerboard test timeout

* fix 333 (#340)

* added label correction to train gradient

* changing the gradient data generator to take inline/crossline argument conssistent with the patchloader

* changing variable name to be more descriptive


Co-authored-by: maxkazmsft <maxkaz@microsoft.com>

* bug fix to model predictions (#345)

* replace hrnet with seresnet in experiments - provides stable default model (#343)

Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com>
Co-authored-by: Fatemeh <fazamani@microsoft.com>

* typos

Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com>
Co-authored-by: Fatemeh <fazamani@microsoft.com>

Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com>
Co-authored-by: Fatemeh <fazamani@microsoft.com>
Co-authored-by: Sharat Chikkerur <sharat.chikkerur@microsoft.com>
Co-authored-by: kirasoderstrom <kirasoderstrom@gmail.com>
Co-authored-by: Sharat Chikkerur <sharat.chikkerur@gmail.com>
Co-authored-by: Geisa Faustino <32823639+GeisaFaustino@users.noreply.github.com>
Co-authored-by: Ricardo Squassina Lee <8495707+squassina@users.noreply.github.com>
Co-authored-by: Michael Zawacki <mikezawacki@hotmail.com>
Co-authored-by: Anna Zietlow <annamzietlow@gmail.com>
This commit is contained in:
maxkazmsft 2020-07-07 15:49:48 -04:00 коммит произвёл GitHub
Родитель 15d45fb8c9
Коммит 080cf46fe9
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
91 изменённых файлов: 4088 добавлений и 881 удалений

Просмотреть файл

@ -0,0 +1,5 @@
{
"subscription_id": "input_sub_id",
"resource_group": "input_resource_group",
"workspace_name": "input_workspace_name"
}

8
.env.example Normal file
Просмотреть файл

@ -0,0 +1,8 @@
BLOB_ACCOUNT_NAME=
BLOB_CONTAINER_NAME=
BLOB_ACCOUNT_KEY=
BLOB_SUB_ID=
AML_COMPUTE_CLUSTER_NAME=
AML_COMPUTE_CLUSTER_MIN_NODES=
AML_COMPUTE_CLUSTER_MAX_NODES=
AML_COMPUTE_CLUSTER_SKU=

6
.gitignore поставляемый
Просмотреть файл

@ -115,4 +115,8 @@ interpretation/environment/anaconda/local/src/cv-lib
# Rope project settings
.ropeproject
*.pth
*.pth
# Seismic data files
*.sgy
*.segy

122
README.md
Просмотреть файл

@ -19,7 +19,7 @@ For developers, we offer a more hands-on Quick Start below.
#### Dev Quick Start
There are two ways to get started with the DeepSeismic codebase, which currently focuses on Interpretation:
- if you'd like to get an idea of how our interpretation (segmentation) models are used, simply review the [HRNet demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb)
- if you'd like to get an idea of how our interpretation (segmentation) models are used, simply review the [demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb)
- to run the code, you'll need to set up a compute environment (which includes setting up a GPU-enabled Linux VM and downloading the appropriate Anaconda Python packages) and download the datasets which you'd like to work with - detailed steps for doing this are provided in the next `Interpretation` section below.
If you run into any problems, chances are your problem has already been solved in the [Troubleshooting](#troubleshooting) section.
@ -27,10 +27,14 @@ If you run into any problems, chances are your problem has already been solved i
The notebook is designed to be run in demo mode by default using a pre-trained model in under 5 minutes on any reasonable Deep Learning GPU such as nVidia K80/P40/P100/V100/TitanV.
### Azure Machine Learning
[Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/) enables you to train and deploy your machine learning models and pipelines at scale, and leverage open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn. If you are looking at getting started with using the code in this repository with Azure Machine Learning, refer to [Azure Machine Learning How-to](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml) to get started.
[Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/) enables you to train and deploy your machine learning models and pipelines at scale, and leverage open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn.
If you are looking at getting started with using the code in this repository with Azure Machine Learning, refer to [Azure Machine Learning How-to](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml) to get started.
## Interpretation
For seismic interpretation, the repository consists of extensible machine learning pipelines, that shows how you can leverage state-of-the-art segmentation algorithms (UNet, SEResNET, HRNet) for seismic interpretation.
We currently support rectangular data, i.e. 2D and 3D seismic images which form a rectangle in 2D.
We also provide [utilities](./examples/interpretation/segyconverter/README.md) for converting SEGY data with rectangular boundaries into numpy arrays
where everything outside the boundary has been padded to produce a rectangular 3D numpy volume.
To run examples available on the repo, please follow instructions below to:
1) [Set up the environment](#setting-up-environment)
@ -85,23 +89,19 @@ This repository provides examples on how to run seismic interpretation on Dutch
Please make sure you have enough disk space to download either dataset.
We have experiments and notebooks which use either one dataset or the other. Depending on which experiment/notebook you want to run you'll need to download the corresponding dataset. We suggest you start by looking at [HRNet demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb) which requires the Dutch F3 dataset.
We have experiments and notebooks which use either one dataset or the other. Depending on which experiment/notebook you want to run you'll need to download the corresponding dataset. We suggest you start by looking at [demo notebook](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb) which requires the Dutch F3 dataset.
#### Dutch F3 Netherlands dataset prep
To download the F3 Netherlands dataset for 2D experiments, please follow the data download instructions at
#### Dutch F3 dataset prep
To download the Dutch F3 dataset for 2D experiments, please follow the data download instructions at
[this github repository](https://github.com/yalaudah/facies_classification_benchmark) (section Dataset). Atternatively, you can use the [download script](scripts/download_dutch_f3.sh)
```
```bash
data_dir="$HOME/data/dutch"
mkdir -p "${data_dir}"
./scripts/download_dutch_f3.sh "${data_dir}"
```
Download scripts also automatically create any subfolders in `${data_dir}` which are needed for the data preprocessing scripts.
At this point, your `${data_dir}` directory should contain a `data` folder, which should look like this:
```
Download scripts also automatically create any subfolders in `${data_dir}` which are needed for the data preprocessing scripts. At this point, your `${data_dir}` directory should contain a `data` folder, which should look like this:
```bash
data
├── splits
├── test_once
@ -113,10 +113,8 @@ data
├── train_labels.npy
└── train_seismic.npy
```
To prepare the data for the experiments (e.g. split into train/val/test), please run the following script:
```
```bash
# change working directory to scripts folder
cd scripts
@ -125,40 +123,66 @@ python prepare_dutchf3.py split_train_val patch --data_dir=${data_dir}/data --la
--stride=50 --patch_size=100 --split_direction=both
# For section-based experiments
python prepare_dutchf3.py split_train_val section --data-dir=${data_dir}/data --label_file=train/train_labels.npy --output_dir=splits \ --split_direction=both
python prepare_dutchf3.py split_train_val section --data-dir=${data_dir}/data --label_file=train/train_labels.npy --output_dir=splits --split_direction=both
# go back to repo root
cd ..
```
Refer to the script itself for more argument options.
#### Bring Your Own Data [BYOD]
##### Bring your own SEG-Y data
If you want to train these models using your own seismic and label data, the files will need to be prepped and
converted to npy files. Typically, the [segyio](https://pypi.org/project/segyio/) can be used to open SEG-Y files that follow the standard, but more often than not, there are non standard settings or missing traces that will cause segyio to fail. If this happens with your data, read these notebooks and scripts to help prepare your data files:
* [SEG-Y Data Prep README](contrib/segyconverter/README.md)
* [convert_segy.py utility](contrib/segyconverter/convert_segy.py) - Utility script that can read SEG-Y files with unusual byte header locations and missing traces
* [segy_convert_sample notebook](contrib/segyconverter/segy_convert_sample.ipynb) - Details on SEG-Y data conversion
* [segy_sample_files notebook](contrib/segyconverter/segy_sample_files.ipynb) - Create test SEG-Y files that describe the scenarios that may cause issues when converting the data to numpy arrays
##### Penobscot example
We also offer starter code to convert [Penobscot](https://arxiv.org/abs/1905.04307) dataset (available [here](https://zenodo.org/record/3924682))
into Tensor format used by the Dutch F3 dataset - once converted, you can run Penobscot through the same
mechanisms as the Dutch F3 dataset. The rough sequence of steps is:
```bash
conda activate seismic-interpretation
cd scripts
wget -o /dev/null -O dataset.h5 https://zenodo.org/record/3924682/files/dataset.h5?download=1
# convert penobscot
python byod_penobscot.py --filename dataset.h5 --outdir <where to output data>
# preprocess for experiments
python prepare_dutchf3.py split_train_val patch --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
```
### Run Examples
#### Notebooks
We provide example notebooks under `examples/interpretation/notebooks/` to demonstrate how to train seismic interpretation models and evaluate them on Penobscot and F3 datasets.
Make sure to run the notebooks in the conda environment we previously set up (`seismic-interpretation`). To register the conda environment in Jupyter, please run:
```
python -m ipykernel install --user --name seismic-interpretation
```
__Optional__: if you plan to develop a notebook, you can install black formatter with the following commands:
```bash
conda activate seismic-interpretation
jupyter nbextension install https://github.com/drillan/jupyter-black/archive/master.zip --user
jupyter nbextension enable jupyter-black-master/jupyter-black
```
This will enable your notebook with a Black formatter button, which then clicked will automatically format a notebook cell which you're in.
#### Experiments
We also provide scripts for a number of experiments we conducted using different segmentation approaches. These experiments are available under `experiments/interpretation`, and can be used as examples. Within each experiment start from the `train.sh` and `test.sh` scripts under the `local/` directory, which invoke the corresponding python scripts, `train.py` and `test.py`. Take a look at the experiment configurations (see Experiment Configuration Files section below) for experiment options and modify if necessary.
We also provide scripts for a number of experiments we conducted using different segmentation approaches. These experiments are available under `experiments/interpretation`, and can be used as examples. Within each experiment start from the `train.sh` and `test.sh` scripts which invoke the corresponding python scripts, `train.py` and `test.py`. Take a look at the experiment configurations (see Experiment Configuration Files section below) for experiment options and modify if necessary.
This release currently supports Dutch F3 local execution
- [F3 Netherlands Patch](experiments/interpretation/dutchf3_patch/README.md)
This release currently supports Dutch F3 local and distributed training
- [Dutch F3 Patch](experiments/interpretation/dutchf3_patch/README.md)
Please note that we use [NVIDIA's NCCL](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html) library to enable distributed training. Please follow the installation instructions [here](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html#down) to install NCCL on your system.
#### Configuration Files
We use [YACS](https://github.com/rbgirshick/yacs) configuration library to manage configuration options for the experiments. There are three ways to pass arguments to the experiment scripts (e.g. train.py or test.py):
@ -166,17 +190,22 @@ We use [YACS](https://github.com/rbgirshick/yacs) configuration library to manag
- __default.py__ - A project config file `default.py` is a one-stop reference point for all configurable options, and provides sensible defaults for all arguments. If no arguments are passed to `train.py` or `test.py` script (e.g. `python train.py`), the arguments are by default loaded from `default.py`. Please take a look at `default.py` to familiarize yourself with the experiment arguments the script you run uses.
- __yml config files__ - YAML configuration files under `configs/` are typically created one for each experiment. These are meant to be used for repeatable experiment runs and reproducible settings. Each configuration file only overrides the options that are changing in that experiment (e.g. options loaded from `defaults.py` during an experiment run will be overridden by arguments loaded from the yaml file). As an example, to use yml configuration file with the training script, run:
```
python train.py --cfg "configs/seresnet_unet.yaml"
```
- __command line__ - Finally, options can be passed in through `options` argument, and those will override arguments loaded from the configuration file. We created CLIs for all our scripts (using Python Fire library), so you can pass these options via command-line arguments, like so:
```
python train.py DATASET.ROOT "/home/username/data/dutch/data" TRAIN.END_EPOCH 10
```
#### Training
We run an aggressive cosine annealing schedule which starts with a higher Learning Rate (LR) and gradually lowers it over approximately 60 epochs to zero,
at which point we raise LR back up to its original value and lower it again for about 60 epochs; this process continues 5 times, forming 60*5=300 training epochs in total
in 5 cycles; model with the best frequency-weighted IoU is snapshotted to disc during each cycle. We suggest consulting TensorBoard logs to see which training cycle
produced the best model and use that model during scoring.
For multi-GPU training, we run a linear burn-in LR schedule before starting the 5 cosine cycles, then the training continues the same way as for single-GPU.
### Pretrained Models
@ -184,14 +213,6 @@ There are two types of pre-trained models used by this repo:
1. pre-trained models trained on non-seismic Computer Vision datasets which we fine-tune for the seismic domain through re-training on seismic data
2. models which we already trained on seismic data - these are downloaded automatically by our code if needed (again, please see the notebook for a demo above regarding how this is done).
#### HRNet ImageNet weights model
To enable training from scratch on seismic data and to achieve the same results as the benchmarks quoted below you will need to download the HRNet model [pretrained](https://github.com/HRNet/HRNet-Image-Classification) on ImageNet. We are specifically using the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pre-trained model; other HRNet variants are also available [here](https://github.com/HRNet/HRNet-Image-Classification) - you can navigate to those from the [main HRNet landing page](https://github.com/HRNet/HRNet-Object-Detection) for object detection.
Unfortunately, the OneDrive location which is used to host the model is using a temporary authentication token, so there is no way for us to script up model download. There are two ways to upload and use the pre-trained HRNet model on DS VM:
- download the model to your local drive using a web browser of your choice and then upload the model to the DS VM using something like `scp`; navigate to Portal and copy DS VM's public IP from the Overview panel of your DS VM (you can search your DS VM by name in the search bar of the Portal) then use `scp local_model_location username@DS_VM_public_IP:./model/save/path` to upload
- alternatively, you can use the same public IP to open remote desktop over SSH to your Linux VM using [X2Go](https://wiki.x2go.org/doku.php/download:start): you can basically open the web browser on your VM this way and download the model to VM's disk
### Viewers (optional)
@ -222,20 +243,21 @@ This section contains benchmarks of different algorithms for seismic interpretat
#### Dutch F3
| Source | Experiment | PA | FW IoU | MCA | V100 (16GB) training time |
| -------------- | --------------------------- | ----- | ------ | ---- | ------------------------- |
| Alaudah et al. | Section-based | 0.905 | 0.817 | .832 | N/A |
| | Patch-based | 0.852 | 0.743 | .689 | N/A |
| DeepSeismic | Patch-based+fixed | .875 | .784 | .740 | 08h 54min |
| | SEResNet UNet+section depth | .910 | .841 | .809 | 55h 02min |
| | HRNet(patch)+patch_depth | .884 | .795 | .739 | 67h 41min |
| | HRNet(patch)+section_depth | .900 | .820 | .767 | 55h 08min |
| Source | Experiment | PA | FW IoU | MCA | V100 (16GB) training time |
| -------------- | ----------------------------------------- | ----- | ------ | ---- | ------------------------- |
| Alaudah et al. | Section-based | 0.905 | 0.817 | .832 | N/A |
| | Patch-based | 0.852 | 0.743 | .689 | N/A |
| DeepSeismic | Patch-based+fixed | .875 | .784 | .740 | 08h 54min |
| | SEResNet UNet+section depth | .910 | .841 | .809 | 55h 02min |
| | HRNet(patch)+patch_depth (experimental) | .884 | .795 | .739 | 67h 41min |
| | HRNet(patch)+section_depth (experimental) | .900 | .820 | .767 | 55h 08min |
Note: these are single-run performance numbers and we expect the results to fluctuate in-between different runs, i.e. some variability is to be expected,
but we expect the performance numbers to be close to these with this codebase.
#### Reproduce benchmarks
In order to reproduce the benchmarks, you will need to navigate to the [experiments](experiments) folder. In there, each of the experiments are split into different folders. To run the Netherlands F3 experiment navigate to the [dutchf3_patch/local](experiments/interpretation/dutchf3_patch/local) folder. In there is a training script [([train.sh](experiments/interpretation/dutchf3_patch/local/train.sh))
which will run the training for any configuration you pass in. Once you have run the training you will need to run the [test.sh](experiments/interpretation/dutchf3_patch/local/test.sh) script. Make sure you specify
the path to the best performing model from your training run, either by passing it in as an argument or altering the YACS config file.
In order to reproduce the benchmarks, you will need to navigate to the [experiments](experiments) folder. In there, each of the experiments are split into different folders. To run the Dutch F3 experiment navigate to the [dutchf3_patch](experiments/interpretation/dutchf3_patch/) folder. In there is a training script [train.sh](experiments/interpretation/dutchf3_patch/train.sh)
which will run the training for any configuration you pass in. If your machine has multiple GPUs, you can run distributed training using the distributed training script [train_distributed.sh](experiments/interpretation/dutchf3_patch/train_distributed.sh). Once you have run the training you will need to run the [test.sh](experiments/interpretation/dutchf3_patch/test.sh) script. Make sure you specify the path to the best performing model from your training run, either by passing it in as an argument or altering the YACS config file.
## Contributing
@ -288,11 +310,11 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
<summary><b>Data Science Virtual Machine conda package installation warnings</b></summary>
It could happen that while creating the conda environment defined by `environment/anaconda/local/environment.yml` on an Ubuntu DSVM, one can get multiple warnings like so:
```
```bash
WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(140): Could not remove or rename /anaconda/pkgs/ipywidgets-7.5.1-py_0/site-packages/ipywidgets-7.5.1.dist-info/LICENSE. Please remove this file manually (you may need to reboot to free file handles)
```
If this happens, similar to instructions above, stop the conda environment creation (type ```Ctrl+C```) and then change recursively the ownership /anaconda directory from root to current user, by running this command:
If this happens, similar to instructions above, stop the conda environment creation (type ```Ctrl+C```) and then change recursively the ownership `/anaconda` directory from root to current user, by running this command:
```bash
sudo chown -R $USER /anaconda
@ -322,17 +344,14 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
torch.cuda.is_available()
```
The output should say "True".
If the output is still "False", you may want to try setting your environment variable to specify the device manually - to test this, start a new `ipython` session and type:
The output should say `True`. If the output is still `False`, you may want to try setting your environment variable to specify the device manually - to test this, start a new `ipython` session and type:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES']='0'
import torch
torch.cuda.is_available()
```
The output should say "True" this time. If it does, you can make the change permanent by adding
The output should say `True` this time. If it does, you can make the change permanent by adding:
```bash
export CUDA_VISIBLE_DEVICES=0
```
@ -367,4 +386,3 @@ which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this
5. Navigate back to the Virtual Machine view in Step 2 and click the Start button to start the virtual machine.
</details>

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -6,3 +6,15 @@ We encourage submissions to the contrib folder, and once they are well-tested, d
Thank you.
#### Azure Machine Learning
If you would like to leverage Azure Machine Learning to create a Training Pipeline with this dataset we have guidance on how do so [here](interpretation/deepseismic_interpretation/azureml_pipelines/README.md)
### HRNet model guidance (experimental for now)
#### HRNet ImageNet weights model
To enable training from scratch on seismic data and to achieve the same results as the benchmarks quoted below you will need to download the HRNet model [pretrained](https://github.com/HRNet/HRNet-Image-Classification) on ImageNet. We are specifically using the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pre-trained model; other HRNet variants are also available [here](https://github.com/HRNet/HRNet-Image-Classification) - you can navigate to those from the [main HRNet landing page](https://github.com/HRNet/HRNet-Object-Detection) for object detection.
Unfortunately, the OneDrive location which is used to host the model is using a temporary authentication token, so there is no way for us to script up model download. There are two ways to upload and use the pre-trained HRNet model on DS VM:
- download the model to your local drive using a web browser of your choice and then upload the model to the DS VM using something like `scp`; navigate to Portal and copy DS VM's public IP from the Overview panel of your DS VM (you can search your DS VM by name in the search bar of the Portal) then use `scp local_model_location username@DS_VM_public_IP:./model/save/path` to upload
- alternatively, you can use the same public IP to open remote desktop over SSH to your Linux VM using [X2Go](https://wiki.x2go.org/doku.php/download:start): you can basically open the web browser on your VM this way and download the model to VM's disk

Просмотреть файл

@ -19,7 +19,7 @@ Now you're all set to run training and testing experiments on the F3 Netherlands
### Monitoring progress with TensorBoard
- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
written to the `output` folder
- open a web-browser and go to either vmpublicip:6006 if running remotely or localhost:6006 if running locally
- open a web-browser and go to either `<vm_public_ip>:6006` if running remotely or localhost:6006 if running locally
> **NOTE**:If running remotely remember that the port must be open and accessible
More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).

Просмотреть файл

@ -20,7 +20,7 @@ Also follow instructions for [downloading and preparing](../../../README.md#peno
### Monitoring progress with TensorBoard
- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
written to the `output` folder
- open a web-browser and go to either vmpublicip:6006 if running remotely or localhost:6006 if running locally
- open a web-browser and go to either `<vm_public_ip>:6006` if running remotely or `localhost:6006` if running locally
> **NOTE**:If running remotely remember that the port must be open and accessible
More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).

Просмотреть файл

@ -39,7 +39,7 @@ nohup time python train.py \
# wait for python to pick up the runtime env before switching it
sleep 1
cd ../../dutchf3_patch/local
cd ../../dutchf3_patch
# patch based without skip connections
export CUDA_VISIBLE_DEVICES=2

Просмотреть файл

@ -1,7 +1,11 @@
#!/bin/bash
# number of GPUs to train on
NGPU=8
NGPUS=$(nvidia-smi -L | wc -l)
if [ "$NGPUS" -lt "2" ]; then
echo "ERROR: cannot run distributed training without 2 or more GPUs."
exit 1
fi
# specify pretrained HRNet backbone
PRETRAINED_HRNET='/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth'
# DATA_F3='/home/alfred/data/dutch/data'
@ -15,9 +19,8 @@ unset CUDA_VISIBLE_DEVICES
# bug to fix conda not launching from a bash shell
source /data/anaconda/etc/profile.d/conda.sh
conda activate seismic-interpretation
export PYTHONPATH=/storage/repos/forks/seismic-deeplearning-1/interpretation:$PYTHONPATH
cd experiments/interpretation/dutchf3_patch/distributed/
cd experiments/interpretation/dutchf3_patch/
# patch based without skip connections
nohup time python -m torch.distributed.launch --nproc_per_node=${NGPU} train.py \

Просмотреть файл

@ -59,7 +59,7 @@ nohup time python test.py \
--cfg "configs/${CONFIG_NAME}.yaml" > ${CONFIG_NAME}_test.log 2>&1 &
sleep 1
cd ../../dutchf3_patch/local
cd ../../dutchf3_patch
# patch based without skip connections
export CUDA_VISIBLE_DEVICES=2
@ -140,7 +140,7 @@ wait
# scoring scripts are in the local folder
# models are in the distributed folder
cd ../../dutchf3_patch/local
cd ../../dutchf3_patch
# patch based without skip connections
export CUDA_VISIBLE_DEVICES=2

Просмотреть файл

@ -0,0 +1,110 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# Pull request against these branches will trigger this build
pr:
- master
- staging
- contrib
# Any commit to this branch will trigger the build.
trigger:
- master
- staging
- contrib
jobs:
# partially disable setup for now - done manually on build VM
- job: setup
timeoutInMinutes: 10
displayName: Setup
pool:
name: deepseismicagentpool
steps:
- bash: |
# terminate as soon as any internal script fails
set -e
echo "Running setup..."
pwd
ls
git branch
uname -ra
# TODO: uncomment in the next release to bring back AML
# # setup run environment
# ./scripts/env_reinstall.sh
#
# # use hardcoded root for now because not sure how env changes under ADO policy
# DATA_ROOT="/home/alfred/data_dynamic"
# ./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}
#
# # upload pre-processed data to AML build WASB storage - overwrites by default and auto-creates container name
# azcopy --quiet --recursive \
# --source ${DATA_ROOT}/dutch_f3/data --destination https://${BLOB_ACCOUNT_NAME}.blob.core.windows.net/${BLOB_CONTAINER_NAME}/data \
# --dest-key ${BLOB_ACCOUNT_KEY}
# env:
# BLOB_ACCOUNT_NAME: $(amlbuildstore)
# BLOB_CONTAINER_NAME: "amlbuild"
# BLOB_ACCOUNT_KEY: $(amlbuildstorekey)
#
#
#- job: AML_pipeline_tests
# dependsOn: setup
# timeoutInMinutes: 20
# displayName: AML pipeline tests
# pool:
# name: deepseismicagentpool
# steps:
# - bash: |
# source activate seismic-interpretation
# # TODO: add code which launches your pytest files ("pytest sometest" OR "python test.py")
# # data is in $(amlbuildstore).blob.core.windows.net/amlbuild/data (container amlbuild, virtual folder data)
# # storage key is $(amlbuildstorekey)
# az --version
# az account show
# az login --service-principal -u $SPIDENTITY -p $SPECRET --tenant $SPTENANT
# az account set --subscription $SUB_ID
# mkdir .azureml
# cat <<EOF > .azureml/config.json
# {
# "subscription_id": "$SUB_ID",
# "resource_group": "$RESOURCE_GROUP",
# "workspace_name": "$WORKSPACE_NAME"
# }
# EOF
# pytest interpretation/tests/test_train_pipeline.py || EXITCODE=123
# exit $EXITCODE
# pytest
# env:
# SUB_ID: $(subscription_id)
# RESOURCE_GROUP: $(resource_group)
# WORKSPACE_NAME: $(workspace_name)
# BLOB_ACCOUNT_NAME: $(amlbuildstore)
# BLOB_CONTAINER_NAME: "amlbuild"
# BLOB_ACCOUNT_KEY: $(amlbuildstorekey)
# BLOB_SUB_ID: $(subscription_id)
# AML_COMPUTE_CLUSTER_NAME: "testcluster"
# AML_COMPUTE_CLUSTER_MIN_NODES: "1"
# AML_COMPUTE_CLUSTER_MAX_NODES: "8"
# AML_COMPUTE_CLUSTER_SKU: "STANDARD_NC6"
# SPIDENTITY: $(spidentity)
# SPECRET: $(spsecret)
# SPTENANT: $(sptenant)
# displayName: 'integration tests'
# - job: AML_short_pipeline_test
# dependsOn: setup
# timeoutInMinutes: 5
# displayName: AML short pipeline test
# pool:
# name: deepseismicagentpool
# steps:
# - bash: |
# source activate seismic-interpretation
# # TODO: OPTIONAL! Add a job which launches entire training pipeline for 1 epoch of training (train model for single epoch)
# # if you don't want this then delete the entire job from this file
# python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py --experiment=DEV-train-pipeline-name --orchestrator_config=orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json"

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -31,7 +31,7 @@ class SnapshotHandler:
def __call__(self, engine, to_save):
self._checkpoint_handler(engine, to_save)
if self._snapshot_function():
files = glob.glob(os.path.join(self._model_save_location, self._running_model_prefix + "*"))
files = glob.glob(os.path.join(self._model_save_location, self._running_model_prefix + "*"))
name_postfix = os.path.basename(files[0]).lstrip(self._running_model_prefix)
copyfile(
files[0],

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -10,6 +10,7 @@ from toolz import curry
from cv_lib.segmentation.dutchf3.utils import np_to_tb
from cv_lib.utils import decode_segmap
def create_summary_writer(log_dir):
writer = SummaryWriter(logdir=log_dir)
return writer
@ -20,9 +21,9 @@ def _transform_image(output_tensor):
return torchvision.utils.make_grid(output_tensor, normalize=True, scale_each=True)
def _transform_pred(output_tensor):
def _transform_pred(output_tensor, n_classes):
output_tensor = output_tensor.squeeze().cpu().numpy()
decoded = decode_segmap(output_tensor)
decoded = decode_segmap(output_tensor, n_classes)
return torchvision.utils.make_grid(np_to_tb(decoded), normalize=False, scale_each=False)
@ -111,5 +112,5 @@ def log_results(engine, evaluator, summary_writer, n_classes, stage):
y_pred[mask == 255] = 255
summary_writer.add_image(f"{stage}/Image", _transform_image(image), epoch)
summary_writer.add_image(f"{stage}/Mask", _transform_pred(mask), epoch)
summary_writer.add_image(f"{stage}/Pred", _transform_pred(y_pred), epoch)
summary_writer.add_image(f"{stage}/Mask", _transform_pred(mask, n_classes), epoch)
summary_writer.add_image(f"{stage}/Pred", _transform_pred(y_pred, n_classes), epoch)

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -37,4 +37,3 @@ def git_branch():
def git_hash():
repo = Repo(search_parent_directories=True)
return repo.active_branch.commit.hexsha

Просмотреть файл

@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
cfg.MODEL.IN_CHANNELS == 1
), f"Patch deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
model = patch_deconvnet_skip(n_classes=cfg.DATASET.NUM_CLASSES)
return model

Просмотреть файл

@ -1,11 +1,16 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import logging
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
logger = logging.getLogger(__name__)
class FPAv2(nn.Module):
def __init__(self, input_dim, output_dim):

Просмотреть файл

@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
cfg.MODEL.IN_CHANNELS == 1
), f"Section deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
model = section_deconvnet(n_classes=cfg.DATASET.NUM_CLASSES)
return model

Просмотреть файл

@ -304,4 +304,5 @@ def get_seg_model(cfg, **kwargs):
cfg.MODEL.IN_CHANNELS == 1
), f"Section deconvnet is not implemented to accept {cfg.MODEL.IN_CHANNELS} channels. Please only pass 1 for cfg.MODEL.IN_CHANNELS"
model = section_deconvnet_skip(n_classes=cfg.DATASET.NUM_CLASSES)
return model

Просмотреть файл

@ -430,21 +430,20 @@ class HighResolutionNet(nn.Module):
if pretrained and not os.path.isfile(pretrained):
raise FileNotFoundError(f"The file {pretrained} was not found. Please supply correct path or leave empty")
if os.path.isfile(pretrained):
pretrained_dict = torch.load(pretrained)
logger.info("=> loading pretrained model {}".format(pretrained))
model_dict = self.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict.keys()}
for k, _ in pretrained_dict.items():
logger.info(
'=> loading {} pretrained model {}'.format(k, pretrained))
logger.info("=> loading {} pretrained model {}".format(k, pretrained))
model_dict.update(pretrained_dict)
self.load_state_dict(model_dict)
def get_seg_model(cfg, **kwargs):
model = HighResolutionNet(cfg, **kwargs)
model.init_weights(cfg.MODEL.PRETRAINED)
if "PRETRAINED" in cfg.MODEL.keys():
model.init_weights(cfg.MODEL.PRETRAINED)
return model

Просмотреть файл

@ -113,4 +113,5 @@ class UNet(nn.Module):
def get_seg_model(cfg, **kwargs):
model = UNet(cfg.MODEL.IN_CHANNELS, cfg.DATASET.NUM_CLASSES)
return model

Просмотреть файл

@ -3,7 +3,6 @@
import numpy as np
def _chw_to_hwc(image_array_numpy):
return np.moveaxis(image_array_numpy, 0, -1)

Просмотреть файл

@ -8,13 +8,17 @@ import numpy as np
from matplotlib import pyplot as plt
def normalize(array):
def normalize(array, MIN, MAX):
"""
Normalizes a segmentation mask array to be in [0,1] range
for use with PIL.Image
Normalizes a segmentation image array by the global range of the data,
MIN and MAX, for use with PIL.Image
"""
min = array.min()
return (array - min) / (array.max() - min)
den = MAX - MIN
if den == 0:
den += np.finfo(float).eps
return (array - MIN) / den
def mask_to_disk(mask, fname, n_classes, cmap_name="rainbow"):
@ -30,15 +34,15 @@ def mask_to_disk(mask, fname, n_classes, cmap_name="rainbow"):
Image.fromarray(cmap(mask / n_classes, bytes=True)).save(fname)
def image_to_disk(mask, fname, cmap_name="seismic"):
def image_to_disk(image, fname, MIN, MAX, cmap_name="seismic"):
"""
write segmentation image to disk using a particular colormap
"""
cmap = plt.get_cmap(cmap_name)
Image.fromarray(cmap(normalize(mask), bytes=True)).save(fname)
Image.fromarray(cmap(normalize(image, MIN, MAX), bytes=True)).save(fname)
def decode_segmap(label_mask, colormap_name="rainbow"):
def decode_segmap(label_mask, n_classes, colormap_name="rainbow"):
"""
Decode segmentation class labels into a colour image
Args:
@ -51,7 +55,7 @@ def decode_segmap(label_mask, colormap_name="rainbow"):
cmap = plt.get_cmap(colormap_name)
# loop over the batch
for i in range(label_mask.shape[0]):
im = Image.fromarray(cmap(normalize(label_mask[i, :, :]), bytes=True)).convert("RGB")
im = Image.fromarray(cmap((label_mask[i, :, :] / n_classes), bytes=True)).convert("RGB")
out[i, :, :, :] = np.array(im).swapaxes(0, 2).swapaxes(1, 2)
return out

Просмотреть файл

@ -1,3 +1,6 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import torch
import numpy as np
from pytest import approx

Просмотреть файл

@ -2,7 +2,7 @@ This Docker image allows the user to run the notebooks in this repository on any
# Download the HRNet model:
To run the [`Dutch_F3_patch_model_training_and_evaluation.ipynb`](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb), you will need to manually download the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pretrained model. You can follow the instructions [here.](../README.md#pretrained-models).
To run the [`Dutch_F3_patch_model_training_and_evaluation.ipynb`](https://github.com/microsoft/seismic-deeplearning/blob/master/examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb), you will need to manually download the [HRNet-W48-C](https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk) pretrained model. You can follow the instructions [here](../README.md#pretrained-models).
If you are using an Azure Virtual Machine to run this code, you can download the model to your local machine, and then copy it to your Azure VM through the command below. Please make sure you update the `<azureuser>` and `<azurehost>` feilds.
```bash

Просмотреть файл

@ -12,7 +12,7 @@ dependencies:
- torchvision>=0.5.0
- pandas==0.25.3
- scikit-learn==0.21.3
- tensorflow==2.0
- tensorflow==2.1.0
- opt-einsum>=2.3.2
- tqdm==4.39.0
- itkwidgets==0.23.1
@ -39,4 +39,3 @@ dependencies:
- jupytext==1.3.0
- validators
- pyyaml

Просмотреть файл

@ -1,5 +1,5 @@
The folder contains notebook examples illustrating the use of segmentation algorithms on openly available datasets. Make sure you have followed the [set up instructions](../../README.md) before running these examples. We provide the following notebook examples
* [Dutch F3 dataset](notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb): This notebook illustrates section and patch based segmentation approaches on the [Dutch F3](https://terranubis.com/datainfo/Netherlands-Offshore-F3-Block-Complete) open dataset. This notebook uses denconvolution based segmentation algorithm on 2D patches. The notebook will guide you through visualization of the input volume, setting up model training and evaluation.
* [Dutch F3 dataset](notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb): This notebook illustrates section and patch based segmentation approaches on the [Dutch F3](https://terranubis.com/datainfo/Netherlands-Offshore-F3-Block-Complete) open dataset. This notebook uses deconvolution based segmentation algorithm on 2D patches. The notebook will guide you through visualization of the input volume, setting up model training and evaluation.
To understand the configuration files and the dafault parameters refer to this [section in the top level README](../../README.md#configuration-files)
To understand the configuration files and the default parameters refer to this [section in the top level README](../../README.md#configuration-files)

Просмотреть файл

@ -59,7 +59,7 @@
"source": [
"# load an existing experiment configuration file\n",
"CONFIG_FILE = (\n",
" \"../../../experiments/interpretation/dutchf3_patch/local/configs/hrnet.yaml\"\n",
" \"../../../experiments/interpretation/dutchf3_patch/configs/seresnet_unet.yaml\"\n",
")\n",
"# number of images to score\n",
"N_EVALUATE = 20\n",
@ -239,7 +239,7 @@
"max_snapshots = config.TRAIN.SNAPSHOTS\n",
"papermill = False\n",
"dataset_root = config.DATASET.ROOT\n",
"model_pretrained = config.MODEL.PRETRAINED"
"model_pretrained = config.MODEL.PRETRAINED if \"PRETRAINED\" in config.MODEL.keys() else None"
]
},
{
@ -511,23 +511,17 @@
"TrainPatchLoader = get_patch_loader(config)\n",
"\n",
"train_set = TrainPatchLoader(\n",
" config.DATASET.ROOT,\n",
" config.DATASET.NUM_CLASSES,\n",
" config,\n",
" split=\"train\",\n",
" is_transform=True,\n",
" stride=config.TRAIN.STRIDE,\n",
" patch_size=config.TRAIN.PATCH_SIZE,\n",
" augmentations=train_aug,\n",
")\n",
"n_classes = train_set.n_classes\n",
"logger.info(train_set)\n",
"val_set = TrainPatchLoader(\n",
" config.DATASET.ROOT,\n",
" config.DATASET.NUM_CLASSES,\n",
" config,\n",
" split=\"val\",\n",
" is_transform=True,\n",
" stride=config.TRAIN.STRIDE,\n",
" patch_size=config.TRAIN.PATCH_SIZE,\n",
" augmentations=val_aug,\n",
")\n",
"\n",
@ -865,9 +859,17 @@
"outputs": [],
"source": [
"# use the model which we just fine-tuned\n",
"opts = [\"TEST.MODEL_PATH\", path.join(output_dir, f\"model_f3_nb_seg_hrnet_{train_len}.pth\")]\n",
"if \"hrnet\" in config.MODEL.NAME:\n",
" model_snapshot_name = f\"model_f3_nb_seg_hrnet_{train_len}.pth\"\n",
"elif \"resnet\" in config.MODEL.NAME: \n",
" model_snapshot_name = f\"model_f3_nb_resnet_unet_{train_len}.pth\"\n",
"else:\n",
" raise NotImplementedError(\"We don't support testing this model in this notebook yet\")\n",
" \n",
"opts = [\"TEST.MODEL_PATH\", path.join(output_dir, model_snapshot_name)]\n",
"# uncomment the line below to use the pre-trained model instead\n",
"# opts = [\"TEST.MODEL_PATH\", config.MODEL.PRETRAINED]\n",
"\n",
"config.merge_from_list(opts)"
]
},
@ -877,7 +879,9 @@
"metadata": {},
"outputs": [],
"source": [
"model.load_state_dict(torch.load(config.TEST.MODEL_PATH))\n",
"trained_model = torch.load(config.TEST.MODEL_PATH)\n",
"trained_model = {k.replace(\"module.\", \"\"): v for (k, v) in trained_model.items()}\n",
"model.load_state_dict(trained_model, strict=True)\n",
"model = model.to(device)"
]
},
@ -932,7 +936,7 @@
"# Load test data\n",
"TestSectionLoader = get_test_loader(config)\n",
"test_set = TestSectionLoader(\n",
" config.DATASET.ROOT, config.DATASET.NUM_CLASSES, split=split, is_transform=True, augmentations=section_aug\n",
" config, split=split, is_transform=True, augmentations=section_aug\n",
")\n",
"# needed to fix this bug in pytorch https://github.com/pytorch/pytorch/issues/973\n",
"# one of the workers will quit prematurely\n",

Просмотреть файл

@ -23,9 +23,9 @@ class runningScore(object):
def _fast_hist(self, label_true, label_pred, n_class):
mask = (label_true >= 0) & (label_true < n_class)
hist = np.bincount(
n_class * label_true[mask].astype(int) + label_pred[mask], minlength=n_class ** 2,
).reshape(n_class, n_class)
hist = np.bincount(n_class * label_true[mask].astype(int) + label_pred[mask], minlength=n_class ** 2,).reshape(
n_class, n_class
)
return hist
def update(self, label_trues, label_preds):
@ -152,9 +152,7 @@ def compose_processing_pipeline(depth, aug=None):
def _generate_batches(h, w, ps, patch_size, stride, batch_size=64):
hdc_wdx_generator = itertools.product(
range(0, h - patch_size + ps, stride), range(0, w - patch_size + ps, stride)
)
hdc_wdx_generator = itertools.product(range(0, h - patch_size + ps, stride), range(0, w - patch_size + ps, stride))
for batch_indexes in itertoolz.partition_all(batch_size, hdc_wdx_generator):
yield batch_indexes
@ -166,9 +164,7 @@ def output_processing_pipeline(config, output):
_, _, h, w = output.shape
if config.TEST.POST_PROCESSING.SIZE != h or config.TEST.POST_PROCESSING.SIZE != w:
output = F.interpolate(
output,
size=(config.TEST.POST_PROCESSING.SIZE, config.TEST.POST_PROCESSING.SIZE),
mode="bilinear",
output, size=(config.TEST.POST_PROCESSING.SIZE, config.TEST.POST_PROCESSING.SIZE), mode="bilinear",
)
if config.TEST.POST_PROCESSING.CROP_PIXELS > 0:
@ -183,15 +179,7 @@ def output_processing_pipeline(config, output):
def patch_label_2d(
model,
img,
pre_processing,
output_processing,
patch_size,
stride,
batch_size,
device,
num_classes,
model, img, pre_processing, output_processing, patch_size, stride, batch_size, device, num_classes,
):
"""Processes a whole section"""
img = torch.squeeze(img)
@ -205,19 +193,14 @@ def patch_label_2d(
# generate output:
for batch_indexes in _generate_batches(h, w, ps, patch_size, stride, batch_size=batch_size):
batch = torch.stack(
[
pipe(img_p, _extract_patch(hdx, wdx, ps, patch_size), pre_processing)
for hdx, wdx in batch_indexes
],
[pipe(img_p, _extract_patch(hdx, wdx, ps, patch_size), pre_processing) for hdx, wdx in batch_indexes],
dim=0,
)
model_output = model(batch.to(device))
for (hdx, wdx), output in zip(batch_indexes, model_output.detach().cpu()):
output = output_processing(output)
output_p[
:, :, hdx + ps : hdx + ps + patch_size, wdx + ps : wdx + ps + patch_size
] += output
output_p[:, :, hdx + ps : hdx + ps + patch_size, wdx + ps : wdx + ps + patch_size] += output
# crop the output_p in the middle
output = output_p[:, :, ps:-ps, ps:-ps]
@ -325,26 +308,22 @@ def download_pretrained_model(config):
elif "penobscot" in config.DATASET.ROOT:
dataset = "penobscot"
else:
raise NameError(
"Unknown dataset name. Only dutch f3 and penobscot are currently supported."
)
raise NameError("Unknown dataset name. Only dutch f3 and penobscot are currently supported.")
if "hrnet" in config.MODEL.NAME:
model = "hrnet"
elif "deconvnet" in config.MODEL.NAME:
model = "deconvnet"
elif "unet" in config.MODEL.NAME:
model = "unet"
elif "resnet" in config.MODEL.NAME:
model = "seresnetunet"
else:
raise NameError(
"Unknown model name. Only hrnet, deconvnet, and unet are currently supported."
)
raise NameError("Unknown model name. Only hrnet, deconvnet, and seresnet_unet are currently supported.")
# check if the user already supplied a URL, otherwise figure out the URL
if validators.url(config.MODEL.PRETRAINED):
if "PRETRAINED" in config.MODEL.keys() and validators.url(config.MODEL.PRETRAINED):
url = config.MODEL.PRETRAINED
print(f"Will use user-supplied URL of '{url}'")
elif os.path.isfile(config.MODEL.PRETRAINED):
elif "PRETRAINED" in config.MODEL.keys() and os.path.isfile(config.MODEL.PRETRAINED):
url = None
print(f"Will use user-supplied file on local disk of '{config.MODEL.PRETRAINED}'")
else:
@ -365,38 +344,26 @@ def download_pretrained_model(config):
url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_hrnet_patch_section_depth.pth"
elif model == "hrnet" and config.TRAIN.DEPTH == "patch":
url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_hrnet_patch_patch_depth.pth"
elif (
model == "deconvnet"
and "skip" in config.MODEL.NAME
and config.TRAIN.DEPTH == "none"
):
elif model == "deconvnet" and "skip" in config.MODEL.NAME and config.TRAIN.DEPTH == "none":
url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_deconvnetskip_patch_no_depth.pth"
elif (
model == "deconvnet"
and "skip" not in config.MODEL.NAME
and config.TRAIN.DEPTH == "none"
):
elif model == "deconvnet" and "skip" not in config.MODEL.NAME and config.TRAIN.DEPTH == "none":
url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_deconvnet_patch_no_depth.pth"
elif model == "unet" and config.TRAIN.DEPTH == "section":
url = "http://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_seresnetunet_patch_section_depth.pth"
elif model == "seresnetunet" and config.TRAIN.DEPTH == "section":
url = "https://deepseismicsharedstore.blob.core.windows.net/master-public-models/dutchf3_seresnetunet_patch_section_depth.pth"
else:
raise NotImplementedError(
"We don't store a pretrained model for Dutch F3 for this model combination yet."
)
else:
raise NotImplementedError(
"We don't store a pretrained model for this dataset/model combination yet."
)
raise NotImplementedError("We don't store a pretrained model for this dataset/model combination yet.")
print(f"Could not find a user-supplied URL, downloading from '{url}'")
# make sure the model_dir directory is writeable
model_dir = config.TRAIN.MODEL_DIR
if not os.path.isdir(os.path.dirname(model_dir)) or not os.access(
os.path.dirname(model_dir), os.W_OK
):
if not os.path.isdir(os.path.dirname(model_dir)) or not os.access(os.path.dirname(model_dir), os.W_OK):
print(f"Cannot write to TRAIN.MODEL_DIR={config.TRAIN.MODEL_DIR}")
home = str(pathlib.Path.home())
model_dir = os.path.join(home, "models")
@ -407,14 +374,10 @@ def download_pretrained_model(config):
if url:
# Download the pretrained model:
pretrained_model_path = os.path.join(
model_dir, "pretrained_" + dataset + "_" + model + ".pth"
)
pretrained_model_path = os.path.join(model_dir, "pretrained_" + dataset + "_" + model + ".pth")
# always redownload the model
print(
f"Downloading the pretrained model to '{pretrained_model_path}'. This will take a few mintues.. \n"
)
print(f"Downloading the pretrained model to '{pretrained_model_path}'. This will take a few mintues.. \n")
urllib.request.urlretrieve(url, pretrained_model_path)
print("Model successfully downloaded.. \n")
else:
@ -424,6 +387,11 @@ def download_pretrained_model(config):
# Update config MODEL.PRETRAINED
# TODO: Only HRNet uses a pretrained model currently.
# issue https://github.com/microsoft/seismic-deeplearning/issues/267
# now that we have a pre-trained model, we can set it
if "PRETRAINED" not in config.MODEL.keys():
config.MODEL["PRETRAINED"] = "dummy"
opts = [
"MODEL.PRETRAINED",
pretrained_model_path,
@ -432,6 +400,7 @@ def download_pretrained_model(config):
"TEST.MODEL_PATH",
pretrained_model_path,
]
config.merge_from_list(opts)
return config

Просмотреть файл

@ -0,0 +1,155 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generate Sythetic SEGY files for testing\n",
"\n",
"This notebook builds the test data used by the convert_segy unit tests. It covers just a few of the SEG-Y files that could be encountered if you bring your own SEG-Y files for training. This is not a comprehensive set of files so there still may be situations where the segyio or the convert_segy.py utility would fail to load the SEG-Y data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import deepseismic_interpretation.segyconverter.utils.create_segy as utils\n",
"import segyio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create sample SEG-Y files for testing\n",
"\n",
"1. Control, that represents a perfect data, with no missing traces.\n",
"2. Missing traces on the top-left and bottom right of the geographic field w/ inline sorting\n",
"3. Missing traces on the top-left and bottom right of the geographic field w/ crossline sorting\n",
"4. Missing trace in the center of the geographic field w/ inline sorting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Control File\n",
"\n",
"Create a file that has a cuboid shape with traces at all inline/crosslines"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"controlfile = './normalsegy.segy'\n",
"utils.create_segy_file(lambda il, xl: True, controlfile)\n",
"utils.show_segy_details(controlfile)\n",
"utils.load_segy_with_geometry(controlfile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inline Error File\n",
"\n",
"inlineerror.segy will throw an error that inlines are not unique because it assumes the same number of inlines per crossline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"inlinefile = './inlineerror.segy'\n",
"utils.create_segy_file(lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),\n",
" inlinefile, segyio.TraceSortingFormat.INLINE_SORTING)\n",
"utils.show_segy_details(inlinefile)\n",
"# Cannot load this file with inferred geometry; segyio will fail\n",
"# utils.load_segy_with_geometry(inlinefile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Crossline Error File\n",
"\n",
"xlineerror.segy will throw an error that crosslines are not unique because it assumes the same number of crosslines per inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xlineerrorfile = './xlineerror.segy'\n",
"utils.create_segy_file(lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),\n",
" xlineerrorfile, segyio.TraceSortingFormat.CROSSLINE_SORTING)\n",
"utils.show_segy_details(xlineerrorfile)\n",
"# Cannot load this file with inferred geometry; segyio will fail\n",
"# utils.load_segy_with_geometry(xlineerrorfile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cube hole SEG-Y file\n",
"\n",
"When collecting seismic data, unless in an area of open ocean, it is rare to be able to collect all trace data from a rectangular field make the collection of traces from a uniform field \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cubehole_segyfile = './cubehole.segy'\n",
"utils.create_segy_file(lambda il, xl: not ((20 < il < 30) and (150 < xl < 250)),\n",
" cubehole_segyfile, segyio.TraceSortingFormat.INLINE_SORTING)\n",
"utils.show_segy_details(cubehole_segyfile)\n",
"# Cannot load this file with inferred geometry; segyio will fail\n",
"# utils.load_segy_with_geometry(cubehole_segyfile)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "seismic-interpretation",
"language": "python",
"name": "seismic-interpretation"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,158 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Converting SEG-Y files for training or validation\n",
"\n",
"This notebook describes how to prepare your own SEG-Y files for training.\n",
"\n",
"If you dont have your owns SEG-Y file, you can run *01_segy_sample_files.jpynb* notebook for generating synthetics files.\n",
"\n",
"To use your own SEG-Y volumes to train models in the DeepSeismic repo, you need to bring at least one pair of ground truth and label data SEG-Y files where the files have an identical shape. The seismic data file contains typical SEG-Y post stack data traces and the label data file should contain an integer class label at every sample in each trace.\n",
"\n",
"For each SEG-Y file, run the convert_segy.py script to create a npy file. Optionally, you can normalize and/or clip the data in the SEG-Y file as it is converted to npy.\n",
"\n",
"Once you have a pair of ground truth and related label npy files, you can edit one of the training scripts in the repo to use these files. One example is the [dutchf3 train.py](../../experiments/interpretation/dutchf3_patch/train.py) script.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from itkwidgets import view\n",
"import numpy as np\n",
"import os\n",
"\n",
"SEGYFILE= './normalsegy.segy'\n",
"PREFIX='normalsegy'\n",
"OUTPUTDIR='data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## convert_segy.py usage"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ./convert_segy.py --help"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example run\n",
"\n",
"Convert the SEG-Y file to a single output npy file in the local directory. Do not normalize or clip the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ./convert_segy.py --prefix {PREFIX} --input_file {SEGYFILE} --output_dir {OUTPUTDIR} --clip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Post processing instructions\n",
"\n",
"There should now be on npy file in the local directory named donuthole_10_100_00000.npy. The number relate to the anchor point\n",
"of the array. In this case, inline 10, crossline 100, and depth 0 is the origin [0,0,0] of the array.\n",
"\n",
"Rerun the convert_segy script for the related label file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"npydata = np.load(f\"./{OUTPUTDIR}/{PREFIX}_10_100_00000.npy\")\n",
"view(npydata, slicing_planes=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare train/test splits file\n",
"\n",
"Once the data and label segy files are converted to npy, use the `prepare_dutchf3.py` script on the resulting npy file to generate the list of patches as input to the train script.\n",
"\n",
"In the next cell is a example of how to run this script. Note that we are using the same npy (normalsegy_10_100_00000.npy) file as seismic and labels because it is only for ilustration purposes.\n",
"\n",
"Also, once you've prepared the data set, you'll find your files in the following directory tree: \n",
"\n",
"data_dir \n",
"├── output_dir \n",
"├── split \n",
"│&emsp; ├── section_train.txt \n",
"│&emsp; ├── section_train_val.txt \n",
"│&emsp; ├── section_val.txt "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ../../../scripts/prepare_dutchf3.py split_train_val section --data_dir={OUTPUTDIR} --label_file={PREFIX}_10_100_00000.npy --output_dir=splits --section_stride=2 --log_config=None --split_direction=both"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "seismic-interpretation",
"language": "python",
"name": "seismic-interpretation"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,67 @@
# SEG-Y Data Utilities
SEG-Y files can have a lot of variability which makes it difficult to infere the geometry when converting to npy. The segyio module attempts to do so but fails if there are missing traces in the file (which happens regularly). This utility reads traces using segyio with the inferencing turned off to avoid data loading errors and it uses its own logic to place traces into a numpy array. If traces are missing, the values of the npy array in that location are set to zero
## convert_segy.py script
The `convert_segy.py` script can work with SEG-Y files and output data on local disk. This script will process segy files regardless of their structure and output npy files for use in training/scoring. In addition to the npy files, it will write a json file that includes the standard deviation and mean of the original data. The script can additionally use that to normalize and clip that data if indicated in the command line parameters
The resulting npy files will use the following naming convention:
```<prefix>_<inline id>_<xline id>_<depth>.npy```
These inline and xline ids are the upper left location of the data contained in the file and can be later used to identify where the npy file is located in the segy data.
This script use [segyio](https://github.com/equinor/segyio) for interaction with SEG-Y.
To use this script, first activate the `seismic-interpretation` environment defined in this repository's setup instructions in the main [README](../../../README.md) file:
`conda activate seismic-interpretation`
Then follow these examples:
1) Convert a SEG-Y file to a single npy file of the same dimensions:
```
python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir .
```
2) Convert a SEG-Y file to a single npy file of the same dimensions, clip and normalize the results:
```
python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --normalize
```
3) Convert a SEG-Y file to a single npy file of the same dimensions, clip but do not normalize the results:
```
python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --clip
```
4) Split a single SEG-Y file into a set of npy files, each npy array with dimension (100,100,100)
```
python ./convert_segy.py --input_file {SEGYFILE} --prefix {PREFIX} --output_dir . --cube_size 100
```
There are several additional command line arguments that may be needed to load specific segy files (i.e. the byte locations for data headers may be different). Run --help to review the additional commands if needed.
Documentation about the SEG-Y format can be found [here](https://seg.org/Portals/0/SEG/News%20and%20Resources/Technical%20Standards/seg_y_rev2_0-mar2017.pdf).
Regarding data headers, we've found from the industry that those inline and crossline header location standards aren't always followed.
As a result, you will need to print out the text header of the SEG-Y file and read the comments to determine what location was used.
As far as we know, there is no way to programmatically extract this info from the file.
NOTE: Missing traces will be filled in with zero values. A future enhancement to this script should allow for specific values to be used that can be ignored during training.
## Testing
Run [pytest](https://docs.pytest.org/en/latest/getting-started.html) from the segyconverter directory to run the local unit tests.
For running all scripts available in [test foder](../../../interpretation/deepseismic_interpretation/segyconverter/test):
```
pytest test
```
For running a specif script:
```
pytest test/<script_name.py>
```

Просмотреть файл

@ -0,0 +1 @@
../../../interpretation/deepseismic_interpretation/segyconverter/convert_segy.py

Просмотреть файл

@ -1,29 +1,30 @@
## F3 Netherlands Patch Experiments
## Dutch F3 Patch Experiments
In this folder are training and testing scripts that work on the F3 Netherlands dataset.
You can run five different models on this dataset:
* [HRNet](local/configs/hrnet.yaml)
* [SEResNet](local/configs/seresnet_unet.yaml)
* [UNet](local/configs/unet.yaml)
* [PatchDeconvNet](local/configs/patch_deconvnet.yaml)
* [PatchDeconvNet-Skip](local/configs/patch_deconvnet_skip.yaml)
* [HRNet](configs/hrnet.yaml)
* [SEResNet](configs/seresnet_unet.yaml)
* [UNet](configs/unet.yaml)
* [PatchDeconvNet](configs/patch_deconvnet.yaml)
* [PatchDeconvNet-Skip](configs/patch_deconvnet_skip.yaml)
All these models take 2D patches of the dataset as input and provide predictions for those patches. The patches need to be stitched together to form a whole inline or crossline.
To understand the configuration files and the dafault parameters refer to this [section in the top level README](../../../README.md#configuration-files)
To understand the configuration files and the default parameters refer to this [section in the top level README](../../../README.md#configuration-files)
### Setup
Please set up a conda environment following the instructions in the top-level [README.md](../../../README.md#setting-up-environment) file.
Also follow instructions for [downloading and preparing](../../../README.md#f3-Netherlands) the data.
Please set up a conda environment following the instructions in the top-level [README.md](../../../README.md#setting-up-environment) file. Also follow instructions for [downloading and preparing](../../../README.md#f3-Netherlands) the data.
### Running experiments
Now you're all set to run training and testing experiments on the F3 Netherlands dataset. Please start from the `train.sh` and `test.sh` scripts under the `local/` directory, which invoke the corresponding python scripts. Take a look at the project configurations in (e.g in `default.py`) for experiment options and modify if necessary.
Now you're all set to run training and testing experiments on the Dutch F3 dataset. Please start from the `train.sh` and `test.sh` scripts, which invoke the corresponding python scripts. If you have a multi-GPU machine, you can also train the model in a distributed fashion by running `train_distributed.sh`. Take a look at the project configurations in (e.g in `default.py`) for experiment options and modify if necessary.
Please note that we use [NVIDIA's NCCL](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html) library to enable distributed training. Please follow the installation instructions [here](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html#down) to install NCCL on your system.
### Monitoring progress with TensorBoard
- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is
written to the `output` folder
- open a web-browser and go to either vmpublicip:6006 if running remotely or localhost:6006 if running locally
- from the this directory, run `tensorboard --logdir='output'` (all runtime logging information is written to the `output` folder
- open a web-browser and go to either `<vm_public_ip>:6006` if running remotely or `localhost:6006` if running locally
> **NOTE**:If running remotely remember that the port must be open and accessible
More information on Tensorboard can be found [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard#launching_tensorboard).

Просмотреть файл

@ -0,0 +1,13 @@
git+https://github.com/microsoft/seismic-deeplearning.git@contrib#egg=cv_lib&subdirectory=cv_lib
git+https://github.com/microsoft/seismic-deeplearning.git#egg=deepseismic-interpretation&subdirectory=interpretation
opencv-python==4.1.2.30
numpy>=1.17.0
torch==1.4.0
pytorch-ignite==0.3.0.dev20191105 # pre-release until stable available
fire==0.2.1
albumentations==0.4.3
toolz==0.10.0
segyio==1.8.8
scipy==1.1.0
gitpython==3.0.5
yacs==0.1.6

Просмотреть файл

@ -16,6 +16,8 @@ DATASET:
NUM_CLASSES: 6
ROOT: "/home/username/data/dutch/data"
CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
MIN: -1
MAX: 1
MODEL:

Просмотреть файл

@ -14,6 +14,8 @@ DATASET:
NUM_CLASSES: 6
ROOT: /home/username/data/dutch/data
CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
MIN: -1
MAX: 1
MODEL:
NAME: patch_deconvnet

Просмотреть файл

@ -14,6 +14,8 @@ DATASET:
NUM_CLASSES: 6
ROOT: /home/username/data/dutch/data
CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
MIN: -1
MAX: 1
MODEL:
NAME: patch_deconvnet_skip

Просмотреть файл

@ -9,12 +9,15 @@ WORKERS: 4
PRINT_FREQ: 10
LOG_CONFIG: logging.conf
SEED: 2019
OPENCV_BORDER_CONSTANT: 0
DATASET:
NUM_CLASSES: 6
ROOT: /home/username/data/dutch/data
ROOT: "/home/username/data/dutch/data"
CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
MIN: -1
MAX: 1
MODEL:
NAME: resnet_unet

Просмотреть файл

@ -17,6 +17,8 @@ DATASET:
NUM_CLASSES: 6
ROOT: '/home/username/data/dutch/data'
CLASS_WEIGHTS: [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
MIN: -1
MAX: 1
MODEL:
NAME: resnet_unet

Просмотреть файл

@ -37,6 +37,8 @@ _C.DATASET = CN()
_C.DATASET.ROOT = ""
_C.DATASET.NUM_CLASSES = 6
_C.DATASET.CLASS_WEIGHTS = [0.7151, 0.8811, 0.5156, 0.9346, 0.9683, 0.9852]
_C.DATASET.MIN = -1
_C.DATASET.MAX = 1
# common params for NETWORK
_C.MODEL = CN()

Просмотреть файл

@ -1,2 +0,0 @@
#!/bin/bash
python test.py --cfg "configs/seresnet_unet.yaml"

Просмотреть файл

@ -1,2 +0,0 @@
#!/bin/bash
python train.py --cfg "configs/seresnet_unet.yaml"

Просмотреть файл

@ -201,10 +201,23 @@ def _output_processing_pipeline(config, output):
def _patch_label_2d(
model, img, pre_processing, output_processing, patch_size, stride, batch_size, device, num_classes, split, debug
model,
img,
pre_processing,
output_processing,
patch_size,
stride,
batch_size,
device,
num_classes,
split,
debug,
MIN,
MAX,
):
"""Processes a whole section
"""
img = torch.squeeze(img)
h, w = img.shape[-2], img.shape[-1] # height and width
@ -228,19 +241,19 @@ def _patch_label_2d(
# dump the data right before it's being put into the model and after scoring
if debug:
outdir = f"debug/batch_{split}"
outdir = f"debug/test/batch_{split}"
generate_path(outdir)
for i in range(batch.shape[0]):
path_prefix = f"{outdir}/{batch_indexes[i][0]}_{batch_indexes[i][1]}"
model_output = model_output.detach().cpu()
# save image:
image_to_disk(np.array(batch[i, 0, :, :]), path_prefix + "_img.png")
image_to_disk(np.array(batch[i, 0, :, :]), path_prefix + "_img.png", MIN, MAX)
# dump model prediction:
mask_to_disk(model_output[i, :, :, :].argmax(dim=0).numpy(), path_prefix + "_pred.png", num_classes)
# dump model confidence values
for nclass in range(num_classes):
image_to_disk(
model_output[i, nclass, :, :].numpy(), path_prefix + f"_class_{nclass}_conf.png",
model_output[i, nclass, :, :].numpy(), path_prefix + f"_class_{nclass}_conf.png", MIN, MAX
)
# crop the output_p in the middle
@ -249,46 +262,56 @@ def _patch_label_2d(
def _evaluate_split(
split, section_aug, model, pre_processing, output_processing, device, running_metrics_overall, config, debug=False,
split,
section_aug,
model,
pre_processing,
output_processing,
device,
running_metrics_overall,
config,
data_flow,
debug=False,
):
logger = logging.getLogger(__name__)
TestSectionLoader = get_test_loader(config)
test_set = TestSectionLoader(
config.DATASET.ROOT,
config.DATASET.NUM_CLASSES,
split=split,
is_transform=True,
augmentations=section_aug,
debug=debug,
)
test_set = TestSectionLoader(config, split=split, is_transform=True, augmentations=section_aug, debug=debug,)
n_classes = test_set.n_classes
if debug:
data_flow[split] = dict()
data_flow[split]["test_section_loader_length"] = len(test_set)
data_flow[split]["test_input_shape"] = test_set.seismic.shape
data_flow[split]["test_label_shape"] = test_set.labels.shape
data_flow[split]["n_classes"] = n_classes
test_loader = data.DataLoader(test_set, batch_size=1, num_workers=config.WORKERS, shuffle=False)
if debug:
data_flow[split]["test_loader_length"] = len(test_loader)
logger.info("Running in Debug/Test mode")
test_loader = take(2, test_loader)
take_n = 2
test_loader = take(take_n, test_loader)
data_flow[split]["take_n_sections"] = take_n
pred_list, gt_list, img_list = [], [], []
try:
output_dir = generate_path(
f"debug/{config.OUTPUT_DIR}_test_{split}", git_branch(), git_hash(), config.MODEL.NAME, current_datetime(),
f"{config.OUTPUT_DIR}/test/{split}", git_branch(), git_hash(), config.MODEL.NAME, current_datetime(),
)
except:
output_dir = generate_path(f"debug/{config.OUTPUT_DIR}_test_{split}", config.MODEL.NAME, current_datetime(),)
output_dir = generate_path(f"{config.OUTPUT_DIR}/test/{split}", config.MODEL.NAME, current_datetime(),)
running_metrics_split = runningScore(n_classes)
# evaluation mode:
with torch.no_grad(): # operations inside don't track history
model.eval()
total_iteration = 0
for i, (images, labels) in enumerate(test_loader):
logger.info(f"split: {split}, section: {i}")
total_iteration = total_iteration + 1
outputs = _patch_label_2d(
model,
images,
@ -301,10 +324,17 @@ def _evaluate_split(
n_classes,
split,
debug,
config.DATASET.MIN,
config.DATASET.MAX,
)
pred = outputs.detach().max(1)[1].numpy()
gt = labels.numpy()
if debug:
pred_list.append((pred.shape, len(np.unique(pred))))
gt_list.append((gt.shape, len(np.unique(gt))))
img_list.append(images.numpy().shape)
running_metrics_split.update(gt, pred)
running_metrics_overall.update(gt, pred)
@ -312,6 +342,11 @@ def _evaluate_split(
mask_to_disk(pred.squeeze(), os.path.join(output_dir, f"{i}_pred.png"), n_classes)
mask_to_disk(gt.squeeze(), os.path.join(output_dir, f"{i}_gt.png"), n_classes)
if debug:
data_flow[split]["pred_shape"] = pred_list
data_flow[split]["gt_shape"] = gt_list
data_flow[split]["img_shape"] = img_list
# get scores
score, class_iou = running_metrics_split.get_scores()
@ -362,12 +397,14 @@ def test(*options, cfg=None, debug=False):
load_log_configuration(config.LOG_CONFIG)
logger = logging.getLogger(__name__)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
log_dir, model_name = os.path.split(config.TEST.MODEL_PATH)
log_dir, _ = os.path.split(config.TEST.MODEL_PATH)
# load model:
model = getattr(models, config.MODEL.NAME).get_seg_model(config)
model.load_state_dict(torch.load(config.TEST.MODEL_PATH), strict=False)
model = model.to(device) # Send to GPU if available
trained_model = torch.load(config.TEST.MODEL_PATH)
trained_model = {k.replace("module.", ""): v for (k, v) in trained_model.items()}
model.load_state_dict(trained_model, strict=True)
model = model.to(device)
running_metrics_overall = runningScore(n_classes)
@ -395,6 +432,7 @@ def test(*options, cfg=None, debug=False):
output_processing = _output_processing_pipeline(config)
splits = ["test1", "test2"] if "Both" in config.TEST.SPLIT else [config.TEST.SPLIT]
data_flow = dict()
for sdx, split in enumerate(splits):
labels = np.load(path.join(config.DATASET.ROOT, "test_once", split + "_labels.npy"))
section_file = path.join(config.DATASET.ROOT, "splits", "section_" + split + ".txt")
@ -408,9 +446,17 @@ def test(*options, cfg=None, debug=False):
device,
running_metrics_overall,
config,
data_flow,
debug=debug,
)
if debug:
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
fname = f"data_flow_test_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
with open(fname, "w") as f:
json.dump(data_flow, f, indent=1)
# FINAL TEST RESULTS:
score, class_iou = running_metrics_overall.get_scores()
@ -433,7 +479,6 @@ def test(*options, cfg=None, debug=False):
np.savetxt(path.join(log_dir, "confusion.csv"), confusion, delimiter=" ")
if debug:
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
fname = f"metrics_test_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
with open(fname, "w") as fid:
json.dump(

Просмотреть файл

@ -0,0 +1,4 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
python test.py --cfg "configs/seresnet_unet.yaml"

Просмотреть файл

@ -15,17 +15,20 @@ Time to run on single V100 for 300 epochs: 4.5 days
import json
import logging
import logging.config
import os
from os import path
import fire
import numpy as np
import torch
from torch.utils import data
from albumentations import Compose, HorizontalFlip, Normalize, PadIfNeeded, Resize
from ignite.contrib.handlers import CosineAnnealingScheduler
from ignite.contrib.handlers import ConcatScheduler, CosineAnnealingScheduler, LinearCyclicalScheduler
from ignite.engine import Events
from ignite.metrics import Loss
from ignite.utils import convert_tensor
from toolz import curry
from torch.utils import data
from cv_lib.event_handlers import SnapshotHandler, logging_handlers, tensorboard_handlers
from cv_lib.event_handlers.tensorboard_handlers import create_summary_writer
@ -33,7 +36,7 @@ from cv_lib.segmentation import extract_metric_from, models
from cv_lib.segmentation.dutchf3.engine import create_supervised_evaluator, create_supervised_trainer
from cv_lib.segmentation.dutchf3.utils import current_datetime, git_branch, git_hash
from cv_lib.segmentation.metrics import class_accuracy, class_iou, mean_class_accuracy, mean_iou, pixelwise_accuracy
from cv_lib.utils import load_log_configuration, generate_path
from cv_lib.utils import generate_path, load_log_configuration
from deepseismic_interpretation.dutchf3.data import get_patch_loader
from default import _C as config
from default import update_config
@ -47,7 +50,12 @@ def prepare_batch(batch, device=None, non_blocking=False):
)
def run(*options, cfg=None, debug=False):
@curry
def update_sampler_epoch(data_loader, engine):
data_loader.sampler.epoch = engine.state.epoch
def run(*options, cfg=None, local_rank=0, debug=False, input=None, distributed=False):
"""Run training and validation of model
Notes:
@ -62,30 +70,43 @@ def run(*options, cfg=None, debug=False):
default.py
cfg (str, optional): Location of config file to load. Defaults to None.
debug (bool): Places scripts in debug/test mode and only executes a few iterations
input (str, optional): Location of data if Azure ML run,
for local runs input is config.DATASET.ROOT
distributed (bool): This flag tells the training script to run in distributed mode
if more than one GPU exists.
"""
# Configuration:
update_config(config, options=options, config_file=cfg)
# The model will be saved under: outputs/<config_file_name>/<model_dir>
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
try:
output_dir = generate_path(
config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),
)
except:
output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
# if AML training pipeline supplies us with input
if input is not None:
data_dir = input
output_dir = data_dir + config.OUTPUT_DIR
# Logging:
# Start logging
load_log_configuration(config.LOG_CONFIG)
logger = logging.getLogger(__name__)
logger.debug(config.WORKERS)
# Configuration:
update_config(config, options=options, config_file=cfg)
silence_other_ranks = True
world_size = int(os.environ.get("WORLD_SIZE", 1))
distributed = world_size > 1
if distributed:
# FOR DISTRIBUTED: Set the device according to local_rank.
torch.cuda.set_device(local_rank)
# FOR DISTRIBUTED: Initialize the backend. torch.distributed.launch will
# provide environment variables, and requires that you use init_method=`env://`.
torch.distributed.init_process_group(backend="nccl", init_method="env://")
logging.info(f"Started train.py using distributed mode.")
else:
logging.info(f"Started train.py using local mode.")
# Set CUDNN benchmark mode:
torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
# We will write the model under outputs / config_file_name / model_dir
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
# Fix random seeds:
torch.manual_seed(config.SEED)
if torch.cuda.is_available():
@ -125,41 +146,51 @@ def run(*options, cfg=None, debug=False):
# Training and Validation Loaders:
TrainPatchLoader = get_patch_loader(config)
logging.info(f"Using {TrainPatchLoader}")
train_set = TrainPatchLoader(
config.DATASET.ROOT,
config.DATASET.NUM_CLASSES,
split="train",
is_transform=True,
stride=config.TRAIN.STRIDE,
patch_size=config.TRAIN.PATCH_SIZE,
augmentations=train_aug,
debug=debug,
)
train_set = TrainPatchLoader(config, split="train", is_transform=True, augmentations=train_aug, debug=debug,)
logger.info(train_set)
n_classes = train_set.n_classes
val_set = TrainPatchLoader(
config.DATASET.ROOT,
config.DATASET.NUM_CLASSES,
split="val",
is_transform=True,
stride=config.TRAIN.STRIDE,
patch_size=config.TRAIN.PATCH_SIZE,
augmentations=val_aug,
debug=debug,
)
val_set = TrainPatchLoader(config, split="val", is_transform=True, augmentations=val_aug, debug=debug,)
logger.info(val_set)
if debug:
data_flow_dict = dict()
data_flow_dict["train_patch_loader_length"] = len(train_set)
data_flow_dict["validation_patch_loader_length"] = len(val_set)
data_flow_dict["train_input_shape"] = train_set.seismic.shape
data_flow_dict["train_label_shape"] = train_set.labels.shape
data_flow_dict["n_classes"] = n_classes
logger.info("Running in debug mode..")
train_set = data.Subset(train_set, range(config.TRAIN.BATCH_SIZE_PER_GPU * config.NUM_DEBUG_BATCHES))
val_set = data.Subset(val_set, range(config.VALIDATION.BATCH_SIZE_PER_GPU))
train_range = min(config.TRAIN.BATCH_SIZE_PER_GPU * config.NUM_DEBUG_BATCHES, len(train_set))
logging.info(f"train range in debug mode {train_range}")
train_set = data.Subset(train_set, range(train_range))
valid_range = min(config.VALIDATION.BATCH_SIZE_PER_GPU, len(val_set))
val_set = data.Subset(val_set, range(valid_range))
data_flow_dict["train_length_subset"] = len(train_set)
data_flow_dict["validation_length_subset"] = len(val_set)
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)
val_sampler = torch.utils.data.distributed.DistributedSampler(val_set, num_replicas=world_size, rank=local_rank)
train_loader = data.DataLoader(
train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, shuffle=True
train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=train_sampler,
)
val_loader = data.DataLoader(
val_set, batch_size=config.VALIDATION.BATCH_SIZE_PER_GPU, num_workers=1
) # config.WORKERS)
val_set, batch_size=config.VALIDATION.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=val_sampler
)
if debug:
data_flow_dict["train_loader_length"] = len(train_loader)
data_flow_dict["validation_loader_length"] = len(val_loader)
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
fname = f"data_flow_train_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
with open(fname, "w") as f:
json.dump(data_flow_dict, f, indent=2)
# Model:
model = getattr(models, config.MODEL.NAME).get_seg_model(config)
@ -176,12 +207,26 @@ def run(*options, cfg=None, debug=False):
epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2 * len(train_loader)
scheduler = CosineAnnealingScheduler(
optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, cycle_size=snapshot_duration
cosine_scheduler = CosineAnnealingScheduler(
optimizer,
"lr",
config.TRAIN.MAX_LR * world_size,
config.TRAIN.MIN_LR * world_size,
cycle_size=snapshot_duration,
)
# Tensorboard writer:
summary_writer = create_summary_writer(log_dir=path.join(output_dir, "logs"))
if distributed:
warmup_duration = 5 * len(train_loader)
warmup_scheduler = LinearCyclicalScheduler(
optimizer,
"lr",
start_value=config.TRAIN.MAX_LR,
end_value=config.TRAIN.MAX_LR * world_size,
cycle_size=10 * len(train_loader),
)
scheduler = ConcatScheduler(schedulers=[warmup_scheduler, cosine_scheduler], durations=[warmup_duration])
else:
scheduler = cosine_scheduler
# class weights are inversely proportional to the frequency of the classes in the training set
class_weights = torch.tensor(config.DATASET.CLASS_WEIGHTS, device=device, requires_grad=False)
@ -189,70 +234,97 @@ def run(*options, cfg=None, debug=False):
# Loss:
criterion = torch.nn.CrossEntropyLoss(weight=class_weights, ignore_index=255, reduction="mean")
# Model:
if distributed:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device], find_unused_parameters=True)
if silence_other_ranks & local_rank != 0:
logging.getLogger("ignite.engine.engine.Engine").setLevel(logging.WARNING)
# Ignite trainer and evaluator:
trainer = create_supervised_trainer(model, optimizer, criterion, prepare_batch, device=device)
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)
# Set to update the epoch parameter of our distributed data sampler so that we get
# different shuffles
trainer.add_event_handler(Events.EPOCH_STARTED, update_sampler_epoch(train_loader))
transform_fn = lambda output_dict: (output_dict["y_pred"].squeeze(), output_dict["mask"].squeeze())
evaluator = create_supervised_evaluator(
model,
prepare_batch,
metrics={
"nll": Loss(criterion, output_transform=transform_fn),
"nll": Loss(criterion, output_transform=transform_fn, device=device),
"pixacc": pixelwise_accuracy(n_classes, output_transform=transform_fn, device=device),
"cacc": class_accuracy(n_classes, output_transform=transform_fn),
"mca": mean_class_accuracy(n_classes, output_transform=transform_fn),
"ciou": class_iou(n_classes, output_transform=transform_fn),
"mIoU": mean_iou(n_classes, output_transform=transform_fn),
"cacc": class_accuracy(n_classes, output_transform=transform_fn, device=device),
"mca": mean_class_accuracy(n_classes, output_transform=transform_fn, device=device),
"ciou": class_iou(n_classes, output_transform=transform_fn, device=device),
"mIoU": mean_iou(n_classes, output_transform=transform_fn, device=device),
},
device=device,
)
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)
# Logging:
trainer.add_event_handler(
Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.PRINT_FREQ),
)
trainer.add_event_handler(Events.EPOCH_COMPLETED, logging_handlers.log_lr(optimizer))
# The model will be saved under: outputs/<config_file_name>/<model_dir>
config_file_name = "default_config" if not cfg else cfg.split("/")[-1].split(".")[0]
try:
output_dir = generate_path(
config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),
)
except:
output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
# Tensorboard and Logging:
trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_training_output(summary_writer))
trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_validation_output(summary_writer))
if local_rank == 0: # Run only on master process
# Logging:
trainer.add_event_handler(
Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.PRINT_FREQ),
)
trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))
# Checkpointing: snapshotting trained models to disk
checkpoint_handler = SnapshotHandler(
output_dir,
config.MODEL.NAME,
extract_metric_from("mIoU"),
lambda: (trainer.state.iteration % snapshot_duration) == 0,
)
evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})
# Tensorboard and Logging:
summary_writer = create_summary_writer(log_dir=path.join(output_dir, "logs"))
trainer.add_event_handler(Events.EPOCH_STARTED, tensorboard_handlers.log_lr(summary_writer, optimizer, "epoch"))
trainer.add_event_handler(Events.ITERATION_COMPLETED, tensorboard_handlers.log_training_output(summary_writer))
trainer.add_event_handler(
Events.ITERATION_COMPLETED, tensorboard_handlers.log_validation_output(summary_writer)
)
# add specific logger which also triggers printed metrics on training set
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
evaluator.run(train_loader)
tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Training")
logging_handlers.log_metrics(engine, evaluator, stage="Training")
if local_rank == 0: # Run only on master process
tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Training")
logging_handlers.log_metrics(engine, evaluator, stage="Training")
logger.info("Logging training results..")
# add specific logger which also triggers printed metrics on validation set
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
evaluator.run(val_loader)
tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Validation")
logging_handlers.log_metrics(engine, evaluator, stage="Validation")
# dump validation set metrics at the very end for debugging purposes
if engine.state.epoch == config.TRAIN.END_EPOCH and debug:
fname = f"metrics_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
metrics = evaluator.state.metrics
out_dict = {x: metrics[x] for x in ["nll", "pixacc", "mca", "mIoU"]}
with open(fname, "w") as fid:
json.dump(out_dict, fid)
log_msg = " ".join(f"{k}: {out_dict[k]}" for k in out_dict.keys())
logging.info(log_msg)
# Checkpointing: snapshotting trained models to disk
checkpoint_handler = SnapshotHandler(
output_dir,
config.MODEL.NAME,
extract_metric_from("mIoU"),
lambda: (trainer.state.iteration % snapshot_duration) == 0,
)
evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})
if local_rank == 0: # Run only on master process
tensorboard_handlers.log_results(engine, evaluator, summary_writer, n_classes, stage="Validation")
logging_handlers.log_metrics(engine, evaluator, stage="Validation")
logger.info("Logging validation results..")
# dump validation set metrics at the very end for debugging purposes
if engine.state.epoch == config.TRAIN.END_EPOCH and debug:
fname = f"metrics_{config_file_name}_{config.TRAIN.MODEL_DIR}.json"
metrics = evaluator.state.metrics
out_dict = {x: metrics[x] for x in ["nll", "pixacc", "mca", "mIoU"]}
with open(fname, "w") as fid:
json.dump(out_dict, fid)
log_msg = " ".join(f"{k}: {out_dict[k]}" for k in out_dict.keys())
logging.info(log_msg)
logger.info("Starting training")
trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length=len(train_loader), seed=config.SEED)
summary_writer.close()
if local_rank == 0:
summary_writer.close()
if __name__ == "__main__":

Просмотреть файл

@ -0,0 +1,4 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
nohup python train.py --cfg "configs/seresnet_unet.yaml" > train.log 2>&1

Просмотреть файл

@ -0,0 +1,10 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
NGPUS=$(nvidia-smi -L | wc -l)
if [ "$NGPUS" -lt "2" ]; then
echo "ERROR: cannot run distributed training without 2 or more GPUs."
exit 1
fi
nohup python -m torch.distributed.launch --nproc_per_node=$NGPUS train.py \
--distributed --cfg "configs/seresnet_unet.yaml" > train_distributed.log 2>&1 &

Просмотреть файл

@ -0,0 +1,192 @@
# Integrating with AzureML
## AzureML Pipeline Background
Azure Machine Learning is a cloud-based environment you can use to train, deploy, automate, manage, and track ML models.
An Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task. Subtasks are encapsulated as a series of steps within the pipeline. An Azure Machine Learning pipeline can be as simple as one that calls a Python script, so may do just about anything. Pipelines should focus on machine learning tasks such as:
- Data preparation including importing, validating and cleaning, munging and transformation, normalization, and staging
- Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
- Training and validating efficiently and repeatedly. Efficiency might come from specifying specific data subsets, different hardware compute resources, distributed processing, and progress monitoring
- Deployment, including versioning, scaling, provisioning, and access control
An Azure ML pipeline performs a complete logical workflow with an ordered sequence of steps. Each step is a discrete processing action. Pipelines run in the context of an Azure Machine Learning Experiment.
In the early stages of an ML project, it's fine to have a single Jupyter notebook or Python script that does all the work of Azure workspace and resource configuration, data preparation, run configuration, training, and validation. But just as functions and classes quickly become preferable to a single imperative block of code, ML workflows quickly become preferable to a monolithic notebook or script.
By modularizing ML tasks, pipelines support the Computer Science imperative that a component should "do (only) one thing well." Modularity is clearly vital to project success when programming in teams, but even when working alone, even a small ML project involves separate tasks, each with a good amount of complexity. Tasks include: workspace configuration and data access, data preparation, model definition and configuration, and deployment. While the outputs of one or more tasks form the inputs to another, the exact implementation details of any one task are, at best, irrelevant distractions in the next. At worst, the computational state of one task can cause a bug in another.
There are many ways to leverage AzureML. Currently DeepSeismic has integrated with AzureML to train a pipeline, which will include creating an experiment titled "DEV-train-pipeline" which will contain all training runs, associated logs, and the ability to navigate seemlessly through this information. AzureML will take data from a blob storage account and the associated models will be saved to this account upon completion of the run.
Please refer to microsoft docs for additional information on AzureML pipelines and related capabilities ['What are Azure Machine Learning pipelines?'](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines)
## Files needed for this AzureML run
You will need the following files to complete an run in AzureML
- [.azureml/config.json](../../../.azureml.example/config.json) This is used to import your subscription, resource group, and AzureML workspace
- [.env](../../../.env.example) This is used to import your environment variables including blob storage information and AzureML compute cluster specs
- [kickoff_train_pipeline.py](dev/kickoff_train_pipeline.py) This script shows how to run an AzureML train pipeline
- [cancel_run.py](dev/cancel_run.py) This script is used to cancel an AzureML train pipeline run
- [base_pipeline.py](base_pipeline.py) This script is used as a base class and train_pipeline.py inherits from it. This is intended to be a helpful abstraction that an an future addition of an inference pipeline can leverage
- [train_pipeline.py](train_pipeline.py) This script inherts from base_pipeline.py and is used to construct the pipeline and its steps. The script kickoff_train_pipeline.py will call the function defined here and the pipeline_config
- [pipeline_config.json](pipeline_config.json) This pipeline configuration specifies the steps of the pipeline, location of data, and any specific arguments. This is consumed once the kickoff_train_script.py is run
- [train.py](../../../experiments/interpretation/dutchf3_patch/train.py) This is the training script that is used to train the model
- [unet.yaml](../../../experiments/interpretation/dutchf3_patch/configs/unet.yaml) This config specifices the model configuration to be used in train.py and is referenced in the pipeline_config.json
- [azureml_requirements.txt](../../../experiments/interpretation/dutchf3_patch/azureml_requirements.txt) This file holds all dependencies for train.py so they can be installed on the compute in Azure ML
- [logging.config](../../../experiments/interpretation/dutchf3_patch/logging.config) This logging config is used to set up logging
- local environment with cv_lib and interpretation set up using guidance [here](../../../README.md)
## Running a Pipeline in AzureML
Go into the [Azure Portal](https://portal.azure.com) and create a blob storage. Once you have created a [blob storage](https://azure.microsoft.com/en-us/services/storage/blobs/) you may use [Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/vs-azure-tools-storage-manage-with-storage-explorer?tabs=windows) to manage your blob instance. You can either manually upload data through Azure Storage Explorer, or you can use [AzCopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10) to migrate the data to your blob storage. Once you blob storage set up and the data migrated, you may being to fill in the environemnt variables below. There is a an example [.env file](../../../.env.example) that you may leverage. More information on how to activate these environment variables are below.
With your run you will need to specifiy the below compute. Once you populate these variables, AzureML will spin up a run based creation compute, this means that the compute will be created by AzureML at run time specifically for your run. The compute is deleted automatically once the run completes. With AzureML you also have the option of creating and attaching your own compute. For more information on run-based compute creation and persistent compute please refer to the [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets) section in Microsoft docs.
`AML_COMPUTE_CLUSTER_SKU` refers to VM family of the nodes created by Azure Machine Learning Compute. If not specified, defaults to Standard_NC6. For compute options see [HardwareProfile object values](https://docs.microsoft.com/en-us/azure/templates/Microsoft.Compute/2019-07-01/virtualMachines?toc=%2Fen-us%2Fazure%2Fazure-resource-manager%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json#hardwareprofile-object
)
`AML_COMPUTE_CLUSTER_MAX_NODES` refers to the max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute. This is not the max number of nodes for multi-node training, instead this is for the amount of nodes available to process single-node jobs.
If you would like additional information with regards AzureML compute provisioning class please refer to Microsoft docs on [AzureML compute provisioning class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputeprovisioningconfiguration?view=azure-ml-py)
Set the following environment variables:
```
BLOB_ACCOUNT_NAME
BLOB_CONTAINER_NAME
BLOB_ACCOUNT_KEY
BLOB_SUB_ID
AML_COMPUTE_CLUSTER_NAME
AML_COMPUTE_CLUSTER_MIN_NODES
AML_COMPUTE_CLUSTER_MAX_NODES
AML_COMPUTE_CLUSTER_SKU
```
On Linux:
`export VARIABLE=value`
Our code can pick the environment variables from the .env file; alternatively you can `source .env` to activate these variables in your environment. An example .env file is found at the ROOT of this repo [here](../../../.env.example). You can rename this to .env. Feel free to use this as your .env file but be sure to add this to your .gitignore to ensure you do not commit any secrets.
You will be able to download a config.json that will already have your subscription id, resource group, and workspace name directly in the [Azure Portal](https://portal.azure.com). You will want to navigate to your AzureML workspace and then you can click the `Download config.json` option towards the top left of the browser. Once you do this you can rename the .azureml.example folder to .azureml and replace the config.json with your downloaded config.json. If you would prefer to migrate the information manually refer to the guidance below.
Create a .azureml/config.json file in the project's root directory that looks like so:
```json
{
"subscription_id": "<subscription id>",
"resource_group": "<resource group>",
"workspace_name": "<workspace name>"
}
```
At the ROOT of this repo you will find an example [here](../../../.azureml.example/config.json). This is an example please rename the file to .azureml/config.json, input your account information and add this to your .gitignore.
## Training Pipeline
Here's an example of a possible pipeline configuration file:
```json
{
"step1":
{
"type": "MpiStep",
"name": "train step",
"script": "train.py",
"input_datareference_path": "normalized_data/",
"input_datareference_name": "normalized_data_conditioned",
"input_dataset_name": "normalizeddataconditioned",
"source_directory": "train/",
"arguments": ["--splits", "splits",
"--train_data_paths", "normalized_data/file.npy",
"--label_paths", "label.npy"],
"requirements": "train/requirements.txt",
"node_count": 1,
"processes_per_node": 1,
"base_image": "pytorch/pytorch"
}
}
```
If you want to create a train pipeline:
1) All of your steps are isolated
- Your scripts will need to conform to the interface you define in the pipeline configuration file
- I.e., if step1 is expected to output X and step 2 is expecting X as an input, your scripts need to reflect that
- If one of your steps has pip package dependencies, make sure it's specified in a requirements.txt file
- If your script has local dependencies (i.e., is importing from another script) make sure that all dependencies fall underneath the source_directory
2) You have configured your pipeline configuration file to specify the steps needed (see the section below "Configuring a Pipeline" for guidance)
Note: the following arguments are automatically added to any script steps by AzureML:
```--input_data``` and ```--output``` (if output is specified in the pipeline_config.json)
Make sure to add these arguments in your scripts like so:
```python
parser.add_argument('--input_data', type=str, help='path to preprocessed data')
parser.add_argument('--output', type=str, help='output from training')
```
```input_data``` is the absolute path to the input_datareference_path on the blob you specified.
# Configuring a Pipeline
## Train Pipeline
Define parameters for the run in a pipeline configuration file. See an example in this repo [here](pipeline_config.json). For additional guidance on [pipeline steps](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#steps) please refer to Microsoft docs.
```json
{
"step1":
{
"type": "<type of step. Supported types include PythonScriptStep and MpiStep>",
"name": "<name in AzureML for this step>",
"script": "<path to script for this step>",
"output": "<name of the output in AzureML for this step - optional>",
"input_datareference_path": "<path on the data reference for the input data - optional>",
"input_datareference_name": "<name of the data reference in AzureML where the input data lives - optional>",
"input_dataset_name": "<name of the datastore in AzureML - optional>",
"source_directory": "<source directory containing the files for this step>",
"arguments": "<arguments to pass to the script - optional>",
"requirements": "<path to the requirements.txt file for the step - optional>",
"node_count": "<number of nodes to run the script on - optional>",
"processes_per_node": "<number of processes to run on each node - optional>",
"base_image": "<name of an image registered on dockerhub that you want to use as your base image"
},
"step2":
{
.
.
.
}
}
```
## Kicking off a Pipeline
In order to kick off a pipeline, you will need to use the AzureCLI to login to the subscription where your workspace resides. Once you successfully log in, there will be a print out of all of the subscriptions you have access to. You can either get your subscription id this way or you could go directly to the azure portal, navigate to your subscriptions, and then locate the right subscription id to pass into az account set -s:
```bash
az login
az account set -s <subscription id>
```
Kick off the training pipeline defined in your config via your python environment of choice. First activate your local environment that has cv_lib and interpretation set up using guidance [here](../../../README.md). You will run the kick off for the training pipeline from the ROOT directory. The code will look like this:
```python
from src.azml.train_pipeline.train_pipeline import TrainPipeline
orchestrator = TrainPipeline("<path to your pipeline configuration file>")
orchestrator.construct_pipeline()
run = orchestrator.run_pipeline(experiment_name="DEV-train-pipeline")
```
See an example in [dev/kickoff_train_pipeline.py](dev/kickoff_train_pipeline.py)
If you run into a subscription access error you might a work around in [Troubleshooting](##troubleshooting) section.
## Cancelling a Pipeline Run
If you kicked off a pipeline and want to cancel it, run the [cancel_run.py](dev/cancel_run.py) script with the corresponding run_id and step_id. The corresponding run_id and step_id will be printed once you have run the script. You can also find this information when viewing your run in the portal https://portal.azure.com/. If you would prefer to cancel your run in the portal you may also do this as well.
## Troubleshooting
If you run into issues gaining access to the Azure ML subscription, you may be able to connect by using a workaround:
Go to [base_pipeline.py](../base_pipeline.py) and add the following import:
```python
from azureml.core.authentication import AzureCliAuthentication
```
Then find the code where we connect to the workspace which looks like this:
```python
self.ws = Workspace.from_config(path=ws_config)
```
and replace it with this:
```python
cli_auth = AzureCliAuthentication()
self.ws = Workspace(subscription_id=<subscription id>, resource_group=<resource group>, workspace_name=<workspace name>, auth=cli_auth)
```
to get this to run, you will also need to `pip install azure-cli-core`
Then you can go back and follow the instructions above, including az login and setting the subscription, and kick off the pipeline.

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -0,0 +1,388 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
base class for constructing and running an azureml pipeline and some of the
accompanying resources.
"""
from azureml.core import Datastore, Workspace, RunConfiguration
from azureml.core.model import Model
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.dataset import Dataset
from azureml.core.experiment import Experiment
from azureml.pipeline.steps import PythonScriptStep, MpiStep
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig
from azureml.core.runconfig import DEFAULT_GPU_IMAGE
from azureml.core.conda_dependencies import CondaDependencies
from msrest.exceptions import HttpOperationError
from azureml.data.data_reference import DataReference
from azureml.core import Environment
from dotenv import load_dotenv
import os
import re
from abc import ABC, abstractmethod
import json
class DeepSeismicAzMLPipeline(ABC):
"""
Abstract base class for pipelines in AzureML
"""
def __init__(self, pipeline_config, ws_config=None):
"""
constructor for DeepSeismicAzMLPipeline class
:param str pipeline_config: [required] path to the pipeline config file
:param str ws_config: [optional] if not specified, will look for
.azureml/config.json. If you have multiple config files, you
can specify which workspace you want to use by passing the
relative path to the config file in this constructor.
"""
self.ws = Workspace.from_config(path=ws_config)
self._load_environment()
self._load_config(pipeline_config)
self.steps = []
self.pipeline_tags = None
self.last_output_data = None
def _load_config(self, config_path):
"""
helper function for loading in pipeline config file.
:param str config_path: path to the pipeline config file
"""
try:
with open(config_path, "r") as f:
self.config = json.load(f)
except Exception as e:
raise Exception("Was unable to load pipeline config file. {}".format(e))
@abstractmethod
def construct_pipeline(self):
"""
abstract method for constructing a pipeline. Must be implemented by classes
that inherit from this base class.
"""
raise NotImplementedError("construct_pipeline is not implemented")
@abstractmethod
def _setup_steps(self):
"""
abstract method for setting up pipeline steps. Must be implemented by classes
that inherit from this base class.
"""
raise NotImplementedError("setup_steps is not implemented")
def _load_environment(self):
"""
loads environment variables needed for the pipeline.
"""
load_dotenv()
self.account_name = os.getenv("BLOB_ACCOUNT_NAME")
self.container_name = os.getenv("BLOB_CONTAINER_NAME")
self.account_key = os.getenv("BLOB_ACCOUNT_KEY")
self.blob_sub_id = os.getenv("BLOB_SUB_ID")
self.comp_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME")
self.comp_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES")
self.comp_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES")
self.comp_vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU")
def _setup_model(self, model_name, model_path=None):
"""
sets up the model in azureml. Either retrieves an already registered model
or registers a local model.
:param str model_name: [required] name of the model that you want to retrieve
from the workspace or the name you want to give the local
model when you register it.
:param str model_path: [optional] If you do not have a model registered, pass
the relative path to the model locally and it will be
registered.
"""
models = Model.list(self.ws, name=model_name)
for model in models:
if model.name == model_name:
self.model = model
print("Found model: " + self.model.name)
break
if model_path is not None:
self.model = Model.register(model_path=model_path, model_name=model_name, workspace=self.ws)
if self.model is None:
raise Exception(
"""no model was found or registered. Ensure that you
have a model registered in this workspace or that
you passed the path of a local model"""
)
def _setup_datastore(self, blob_dataset_name, output_path=None):
"""
sets up the datastore in azureml. Either retrieves a pre-existing datastore
or registers a new one in the workspace.
:param str blob_dataset_name: [required] name of the datastore registered with the
workspace. If the datastore does not yet exist, the
name it will be registered under.
:param str output_path: [optional] if registering a datastore for inferencing,
the output path for writing back predictions.
"""
try:
self.blob_ds = Datastore.get(self.ws, blob_dataset_name)
print("Found Blob Datastore with name: %s" % blob_dataset_name)
except HttpOperationError:
self.blob_ds = Datastore.register_azure_blob_container(
workspace=self.ws,
datastore_name=blob_dataset_name,
account_name=self.account_name,
container_name=self.container_name,
account_key=self.account_key,
subscription_id=self.blob_sub_id,
)
print("Registered blob datastore with name: %s" % blob_dataset_name)
if output_path is not None:
self.output_dir = PipelineData(
name="output", datastore=self.ws.get_default_datastore(), output_path_on_compute=output_path
)
def _setup_dataset(self, ds_name, data_paths):
"""
registers datasets with azureml workspace
:param str ds_name: [required] name to give the dataset in azureml.
:param str data_paths: [required] list of paths to your data on the datastore.
"""
self.named_ds = []
count = 1
for data_path in data_paths:
curr_name = ds_name + str(count)
path_on_datastore = self.blob_ds.path(data_path)
input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
try:
registered_ds = input_ds.register(workspace=self.ws, name=curr_name, create_new_version=True)
except Exception as e:
n, v = self._parse_exception(e)
registered_ds = Dataset.get_by_name(self.ws, name=n, version=v)
self.named_ds.append(registered_ds.as_named_input(curr_name))
count = count + 1
def _setup_datareference(self, name, path):
"""
helper function to setup a datareference object in AzureML.
:param str name: [required] name of the data reference\
:param str path: [required] path on the datastore where the data lives.
:returns: input_data
:rtype: DataReference
"""
input_data = DataReference(datastore=self.blob_ds, data_reference_name=name, path_on_datastore=path)
return input_data
def _setup_pipelinedata(self, name, output_path=None):
"""
helper function to setup a PipelineData object in AzureML
:param str name: [required] name of the data object in AzureML
:param str output_path: path on output datastore to write data to
:returns: output_data
:rtype: PipelineData
"""
if output_path is not None:
output_data = PipelineData(
name=name,
datastore=self.blob_ds,
output_name=name,
output_mode="mount",
output_path_on_compute=output_path,
is_directory=True,
)
else:
output_data = PipelineData(name=name, datastore=self.ws.get_default_datastore(), output_name=name)
return output_data
def _setup_compute(self):
"""
sets up the compute in the azureml workspace. Either retrieves a
pre-existing compute target or creates one (uses environment variables).
:returns: compute_target
:rtype: ComputeTarget
"""
if self.comp_name in self.ws.compute_targets:
self.compute_target = self.ws.compute_targets[self.comp_name]
if self.compute_target and type(self.compute_target) is AmlCompute:
print("Found compute target: " + self.comp_name)
else:
print("creating a new compute target...")
p_cfg = AmlCompute.provisioning_configuration(
vm_size=self.comp_vm_size, min_nodes=self.comp_min_nodes, max_nodes=self.comp_max_nodes
)
self.compute_target = ComputeTarget.create(self.ws, self.comp_name, p_cfg)
self.compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
print(self.compute_target.get_status().serialize())
return self.compute_target
def _get_conda_deps(self, step):
"""
converts requirements.txt from user into conda dependencies for AzML
:param dict step: step defined by user that we are currently building
:returns: conda_dependencies
:rtype: CondaDependencies
"""
with open(step["requirements"], "r") as f:
packages = [line.strip() for line in f]
return CondaDependencies.create(pip_packages=packages)
def _setup_env(self, step):
"""
sets up AzML env given requirements defined by the user
:param dict step: step defined by user that we are currently building
:returns: env
:rtype: Environment
"""
conda_deps = self._get_conda_deps(step)
env = Environment(name=step["name"] + "_environment")
env.docker.enabled = True
env.docker.base_image = DEFAULT_GPU_IMAGE
env.spark.precache_packages = False
env.python.conda_dependencies = conda_deps
env.python.conda_dependencies.add_conda_package("pip==20.0.2")
return env
def _generate_run_config(self, step):
"""
generates an AzML run config if the user gives specifics about requirements
:param dict step: step defined by user that we are currently building
:returns: run_config
:rtype: RunConfiguration
"""
try:
conda_deps = self._get_conda_deps(step)
conda_deps.add_conda_package("pip==20.0.2")
return RunConfiguration(script=step["script"], conda_dependencies=conda_deps)
except KeyError:
return None
def _generate_parallel_run_config(self, step):
"""
generates an AzML parralell run config if the user gives specifics about requirements
:param dict step: step defined by user that we are currently building
:returns: parallel_run_config
:rtype: ParallelRunConfig
"""
return ParallelRunConfig(
source_directory=step["source_directory"],
entry_script=step["script"],
mini_batch_size=str(step["mini_batch_size"]),
error_threshold=10,
output_action="summary_only",
environment=self._setup_env(step),
compute_target=self.compute_target,
node_count=step.get("node_count", 1),
process_count_per_node=step.get("processes_per_node", 1),
run_invocation_timeout=60,
)
def _create_pipeline_step(self, step, arguments, input_data, output=None, run_config=None):
"""
function to create an AzureML pipeline step and apend it to the list of
steps that will make up the pipeline.
:param dict step: [required] dictionary containing the config parameters for this step.
:param list arguments: [required] list of arguments to be passed to the step.
:param DataReference input_data: [required] the input_data in AzureML for this step.
:param DataReference output: [required] output location in AzureML
:param ParallelRunConfig run_config: [optional] the run configuration for a MpiStep
"""
if step["type"] == "PythonScriptStep":
run_config = self._generate_run_config(step)
pipeline_step = PythonScriptStep(
script_name=step["script"],
arguments=arguments,
inputs=[input_data],
outputs=output,
name=step["name"],
compute_target=self.compute_target,
source_directory=step["source_directory"],
allow_reuse=True,
runconfig=run_config,
)
elif step["type"] == "MpiStep":
pipeline_step = MpiStep(
name=step["name"],
source_directory=step["source_directory"],
arguments=arguments,
inputs=[input_data],
node_count=step.get("node_count", 1),
process_count_per_node=step.get("processes_per_node", 1),
compute_target=self.compute_target,
script_name=step["script"],
environment_definition=self._setup_env(step),
)
elif step["type"] == "ParallelRunStep":
run_config = self._generate_parallel_run_config(step)
pipeline_step = ParallelRunStep(
name=step["name"],
models=[self.model],
parallel_run_config=run_config,
inputs=input_data,
output=output,
arguments=arguments,
allow_reuse=False,
)
else:
raise Exception("Pipeline step type {} not supported".format(step["type"]))
self.steps.append(pipeline_step)
def run_pipeline(self, experiment_name, tags=None):
"""
submits batch inference pipeline as an experiment run
:param str experiment_name: [required] name of the experiment in azureml
:param dict tags: [optional] dictionary of tags
:returns: run
:rtype: Run
"""
if tags is None:
tags = self.pipeline_tags
step_sequence = StepSequence(steps=self.steps)
pipeline = Pipeline(workspace=self.ws, steps=step_sequence)
run = Experiment(self.ws, experiment_name).submit(pipeline, tags=tags, continue_on_step_failure=False)
return run
def _parse_exception(self, e):
"""
helper function to parse exception thrown by azureml
:param Exception e: [required] the exception to be parsed
:returns: name, version
:rtype: str, str
"""
s = str(e)
result = re.search('name="(.*)"', s)
name = result.group(1)
version = s[s.find("version=") + 8 : s.find(")")]
return name, version

Просмотреть файл

@ -0,0 +1,21 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Cancel pipeline run
"""
from azureml.core.run import Run
from azureml.core import Workspace, Experiment
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--run_id", type=str, help="run id value", required=True)
parser.add_argument("--step_id", type=str, help="step id value", required=True)
args = parser.parse_args()
ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name="DEV-train-pipeline", _id=args.run_id)
fetched_run = Run(experiment=experiment, run_id=args.step_id)
fetched_run.cancel()

Просмотреть файл

@ -0,0 +1,32 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Create pipeline and kickoff run
"""
from deepseismic_interpretation.azureml_pipelines.train_pipeline import TrainPipeline
import fire
def kickoff_pipeline(
experiment="DEV-train-pipeline",
orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json",
):
"""Kicks off pipeline run
Args:
experiment (str): name of experiment
orchestrator_config (str): path to pipeline configuration
"""
orchestrator = TrainPipeline(orchestrator_config)
orchestrator.construct_pipeline()
run = orchestrator.run_pipeline(experiment_name=experiment)
if __name__ == "__main__":
"""Example:
python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py --experiment=DEV-train-pipeline-name --orchestrator_config=orchestrator_config="interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json"
or
python interpretation/deepseismic_interpretation/azureml_pipelines/dev/kickoff_train_pipeline.py
"""
fire.Fire(kickoff_pipeline)

Просмотреть файл

@ -0,0 +1,25 @@
{
"step1": {
"type": "MpiStep",
"name": "train step",
"script": "train.py",
"input_datareference_path": "data/",
"input_datareference_name": "ds_test",
"input_dataset_name": "deepseismic_test_dataset",
"source_directory": "experiments/interpretation/dutchf3_patch",
"arguments": [
"--cfg",
"configs/unet.yaml",
"TRAIN.END_EPOCH",
"1",
"TRAIN.SNAPSHOTS",
"1",
"DATASET.ROOT",
"data"
],
"requirements": "experiments/interpretation/dutchf3_patch/azureml_requirements.txt",
"node_count": 1,
"processes_per_node": 1,
"base_image": "pytorch/pytorch"
}
}

Просмотреть файл

@ -0,0 +1,55 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
TrainPipeline class for setting up a training pipeline in AzureML.
Inherits from DeepSeismicAzMLPipeline
"""
from deepseismic_interpretation.azureml_pipelines.base_pipeline import DeepSeismicAzMLPipeline
class TrainPipeline(DeepSeismicAzMLPipeline):
def construct_pipeline(self):
"""
implemented function from ABC. Sets up the pre-requisites for a pipeline.
"""
self._setup_compute()
self._setup_datastore(blob_dataset_name=self.config["step1"]["input_dataset_name"])
self._setup_steps()
def _setup_steps(self):
"""
iterates over all the steps in the config file and sets each one up along
with its accompanying objects.
"""
for _, step in self.config.items():
try:
input_data = self._setup_datareference(
name=step["input_datareference_name"], path=step["input_datareference_path"]
)
except KeyError:
# grab the last step's output as input for this step
if self.last_output_data is None:
raise KeyError(
"input_datareference_name and input_datareference_path can only be"
"omitted if there is a previous step in the pipeline"
)
else:
input_data = self.last_output_data
try:
self.last_output_data = self._setup_pipelinedata(
name=step["output"], output_path=step.get("output_path", None)
)
except KeyError:
self.last_output_data = None
script_params = step["arguments"] + ["--input", input_data]
if self.last_output_data is not None:
script_params = script_params + ["--output", self.last_output_data]
self._create_pipeline_step(
step=step, arguments=script_params, input_data=input_data, output=self.last_output_data
)

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -117,20 +117,21 @@ def read_labels(fname, data_info):
class SectionLoader(data.Dataset):
"""
Base class for section data loader
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
:param bool debug: enable debugging output
"""
def __init__(self, data_dir, n_classes, split="train", is_transform=True, augmentations=None, debug=False):
def __init__(self, config, split="train", is_transform=True, augmentations=None, debug=False):
self.data_dir = config.DATASET.ROOT
self.n_classes = config.DATASET.NUM_CLASSES
self.MIN = config.DATASET.MIN
self.MAX = config.DATASET.MAX
self.split = split
self.data_dir = data_dir
self.is_transform = is_transform
self.augmentations = augmentations
self.n_classes = n_classes
self.sections = list()
self.debug = debug
@ -152,10 +153,10 @@ class SectionLoader(data.Dataset):
im, lbl = _transform_WH_to_HW(im), _transform_WH_to_HW(lbl)
if self.debug and "test" in self.split:
outdir = f"debug/sectionLoader_{self.split}_raw"
outdir = f"debug/test/sectionLoader_{self.split}_raw"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{section_name}"
image_to_disk(im, path_prefix + "_img.png")
image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.augmentations is not None:
@ -166,10 +167,10 @@ class SectionLoader(data.Dataset):
im, lbl = self.transform(im, lbl)
if self.debug and "test" in self.split:
outdir = f"debug/sectionLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
outdir = f"debug/test/sectionLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{section_name}"
image_to_disk(np.array(im[0]), path_prefix + "_img.png")
image_to_disk(np.array(im[0]), path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(np.array(lbl[0]), path_prefix + "_lbl.png", self.n_classes)
return im, lbl
@ -185,8 +186,7 @@ class SectionLoader(data.Dataset):
class TrainSectionLoader(SectionLoader):
"""
Training data loader for sections
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -197,8 +197,7 @@ class TrainSectionLoader(SectionLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="train",
is_transform=True,
augmentations=None,
@ -207,8 +206,7 @@ class TrainSectionLoader(SectionLoader):
debug=False,
):
super(TrainSectionLoader, self).__init__(
data_dir,
n_classes,
config,
split=split,
is_transform=is_transform,
augmentations=augmentations,
@ -240,8 +238,7 @@ class TrainSectionLoader(SectionLoader):
class TrainSectionLoaderWithDepth(TrainSectionLoader):
"""
Section data loader that includes additional channel for depth
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -252,8 +249,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="train",
is_transform=True,
augmentations=None,
@ -262,8 +258,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):
debug=False,
):
super(TrainSectionLoaderWithDepth, self).__init__(
data_dir,
n_classes,
config,
split=split,
is_transform=is_transform,
augmentations=augmentations,
@ -304,8 +299,7 @@ class TrainSectionLoaderWithDepth(TrainSectionLoader):
class TestSectionLoader(SectionLoader):
"""
Test data loader for sections
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -316,8 +310,7 @@ class TestSectionLoader(SectionLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="test1",
is_transform=True,
augmentations=None,
@ -326,7 +319,7 @@ class TestSectionLoader(SectionLoader):
debug=False,
):
super(TestSectionLoader, self).__init__(
data_dir, n_classes, split=split, is_transform=is_transform, augmentations=augmentations, debug=debug,
config, split=split, is_transform=is_transform, augmentations=augmentations, debug=debug,
)
if "test1" in self.split:
@ -356,8 +349,7 @@ class TestSectionLoader(SectionLoader):
class TestSectionLoaderWithDepth(TestSectionLoader):
"""
Test data loader for sections that includes additional channel for depth
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -368,8 +360,7 @@ class TestSectionLoaderWithDepth(TestSectionLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="test1",
is_transform=True,
augmentations=None,
@ -378,8 +369,7 @@ class TestSectionLoaderWithDepth(TestSectionLoader):
debug=False,
):
super(TestSectionLoaderWithDepth, self).__init__(
data_dir,
n_classes,
config,
split=split,
is_transform=is_transform,
augmentations=augmentations,
@ -407,11 +397,11 @@ class TestSectionLoaderWithDepth(TestSectionLoader):
# dump images before augmentation
if self.debug:
outdir = f"debug/testSectionLoaderWithDepth_{self.split}_raw"
outdir = f"debug/test/testSectionLoaderWithDepth_{self.split}_raw"
generate_path(outdir)
# this needs to take the first dimension of image (no depth) but lbl only has 1 dim
path_prefix = f"{outdir}/index_{index}_section_{section_name}"
image_to_disk(im[0, :, :], path_prefix + "_img.png")
image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.augmentations is not None:
@ -425,12 +415,10 @@ class TestSectionLoaderWithDepth(TestSectionLoader):
# dump images and labels to disk after augmentation
if self.debug:
outdir = (
f"debug/testSectionLoaderWithDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
)
outdir = f"debug/test/testSectionLoaderWithDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{section_name}"
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)
return im, lbl
@ -444,33 +432,41 @@ def _transform_WH_to_HW(numpy_array):
class PatchLoader(data.Dataset):
"""
Base Data loader for the patch-based deconvnet
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param int stride: training data stride
:param int patch_size: Size of patch for training
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
:param bool debug: enable debugging output
"""
def __init__(
self, data_dir, n_classes, stride=30, patch_size=99, is_transform=True, augmentations=None, debug=False,
):
self.data_dir = data_dir
def __init__(self, config, split="train", is_transform=True, augmentations=None, debug=False):
self.data_dir = config.DATASET.ROOT
self.n_classes = config.DATASET.NUM_CLASSES
self.split = split
self.MIN = config.DATASET.MIN
self.MAX = config.DATASET.MAX
self.patch_size = config.TRAIN.PATCH_SIZE
self.stride = config.TRAIN.STRIDE
self.is_transform = is_transform
self.augmentations = augmentations
self.n_classes = n_classes
self.patches = list()
self.patch_size = patch_size
self.stride = stride
self.debug = debug
def pad_volume(self, volume):
def pad_volume(self, volume, value):
"""
Only used for train/val!! Not test.
Pads a 3D numpy array with a constant value along the depth direction only.
Args:
volume (numpy ndarrray): numpy array containing the seismic amplitude or labels.
value (int): value to pad the array with.
"""
return np.pad(volume, pad_width=self.patch_size, mode="constant", constant_values=255)
return np.pad(
volume,
pad_width=[(0, 0), (0, 0), (self.patch_size, self.patch_size)],
mode="constant",
constant_values=value,
)
def __len__(self):
return len(self.patches)
@ -479,12 +475,7 @@ class PatchLoader(data.Dataset):
patch_name = self.patches[index]
direction, idx, xdx, ddx = patch_name.split(sep="_")
# Shift offsets the padding that is added in training
# shift = self.patch_size if "test" not in self.split else 0
# Remember we are cancelling the shift since we no longer pad
shift = 0
idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
idx, xdx, ddx = int(idx), int(xdx), int(ddx)
if direction == "i":
im = self.seismic[idx, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -500,7 +491,7 @@ class PatchLoader(data.Dataset):
outdir = f"debug/patchLoader_{self.split}_raw"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
image_to_disk(im, path_prefix + "_img.png")
image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.augmentations is not None:
@ -512,7 +503,7 @@ class PatchLoader(data.Dataset):
outdir = f"patchLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
generate_path(outdir)
path_prefix = f"{outdir}/{index}"
image_to_disk(im, path_prefix + "_img.png")
image_to_disk(im, path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.is_transform:
@ -523,7 +514,7 @@ class PatchLoader(data.Dataset):
outdir = f"debug/patchLoader_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)
return im, lbl
@ -536,46 +527,10 @@ class PatchLoader(data.Dataset):
return torch.from_numpy(img).float(), torch.from_numpy(lbl).long()
class TestPatchLoader(PatchLoader):
"""
Test Data loader for the patch-based deconvnet
:param str data_dir: Root directory for training/test data
:param str n_classes: number of segmentation mask classes
:param int stride: training data stride
:param int patch_size: Size of patch for training
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
:param bool debug: enable debugging output
"""
def __init__(
self, data_dir, n_classes, stride=30, patch_size=99, is_transform=True, augmentations=None, debug=False
):
super(TestPatchLoader, self).__init__(
data_dir,
n_classes,
stride=stride,
patch_size=patch_size,
is_transform=is_transform,
augmentations=augmentations,
debug=debug,
)
## Warning: this is not used or tested
raise NotImplementedError("This class is not correctly implemented.")
self.seismic = np.load(_train_data_for(self.data_dir))
self.labels = np.load(_train_labels_for(self.data_dir))
patch_list = tuple(open(txt_path, "r"))
patch_list = [id_.rstrip() for id_ in patch_list]
self.patches = patch_list
class TrainPatchLoader(PatchLoader):
"""
Train data loader for the patch-based deconvnet
:param str data_dir: Root directory for training/test data
:param int stride: training data stride
:param int patch_size: Size of patch for training
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -584,11 +539,8 @@ class TrainPatchLoader(PatchLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="train",
stride=30,
patch_size=99,
is_transform=True,
augmentations=None,
seismic_path=None,
@ -596,16 +548,9 @@ class TrainPatchLoader(PatchLoader):
debug=False,
):
super(TrainPatchLoader, self).__init__(
data_dir,
n_classes,
stride=stride,
patch_size=patch_size,
is_transform=is_transform,
augmentations=augmentations,
debug=debug,
config, is_transform=is_transform, augmentations=augmentations, debug=debug,
)
warnings.warn("This no longer pads the volume")
if seismic_path is not None and label_path is not None:
# Load npy files (seismc and corresponding labels) from provided
# location (path)
@ -618,8 +563,11 @@ class TrainPatchLoader(PatchLoader):
else:
self.seismic = np.load(_train_data_for(self.data_dir))
self.labels = np.load(_train_labels_for(self.data_dir))
# We are in train/val mode. Most likely the test splits are not saved yet,
# so don't attempt to load them.
# pad the data:
self.seismic = self.pad_volume(self.seismic, value=0)
self.labels = self.pad_volume(self.labels, value=255)
self.split = split
# reading the file names for split
txt_path = path.join(self.data_dir, "splits", "patch_" + split + ".txt")
@ -631,9 +579,7 @@ class TrainPatchLoader(PatchLoader):
class TrainPatchLoaderWithDepth(TrainPatchLoader):
"""
Train data loader for the patch-based deconvnet with patch depth channel
:param str data_dir: Root directory for training/test data
:param int stride: training data stride
:param int patch_size: Size of patch for training
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -642,10 +588,8 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):
def __init__(
self,
data_dir,
config,
split="train",
stride=30,
patch_size=99,
is_transform=True,
augmentations=None,
seismic_path=None,
@ -653,10 +597,8 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):
debug=False,
):
super(TrainPatchLoaderWithDepth, self).__init__(
data_dir,
config,
split=split,
stride=stride,
patch_size=patch_size,
is_transform=is_transform,
augmentations=augmentations,
seismic_path=seismic_path,
@ -668,12 +610,7 @@ class TrainPatchLoaderWithDepth(TrainPatchLoader):
patch_name = self.patches[index]
direction, idx, xdx, ddx = patch_name.split(sep="_")
# Shift offsets the padding that is added in training
# shift = self.patch_size if "test" not in self.split else 0
# Remember we are cancelling the shift since we no longer pad
shift = 0
idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
idx, xdx, ddx = int(idx), int(xdx), int(ddx)
if direction == "i":
im = self.seismic[idx, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -705,9 +642,7 @@ def _transform_HWC_to_CHW(numpy_array):
class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
"""
Train data loader for the patch-based deconvnet section depth channel
:param str data_dir: Root directory for training/test data
:param int stride: training data stride
:param int patch_size: Size of patch for training
:param config: configuration object to define other attributes in loaders
:param str split: split file to use for loading patches
:param bool is_transform: Transform patch to dimensions expected by PyTorch
:param list augmentations: Data augmentations to apply to patches
@ -718,11 +653,8 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
def __init__(
self,
data_dir,
n_classes,
config,
split="train",
stride=30,
patch_size=99,
is_transform=True,
augmentations=None,
seismic_path=None,
@ -730,11 +662,8 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
debug=False,
):
super(TrainPatchLoaderWithSectionDepth, self).__init__(
data_dir,
n_classes,
config,
split=split,
stride=stride,
patch_size=patch_size,
is_transform=is_transform,
augmentations=augmentations,
seismic_path=seismic_path,
@ -747,12 +676,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
patch_name = self.patches[index]
direction, idx, xdx, ddx = patch_name.split(sep="_")
# Shift offsets the padding that is added in training
# shift = self.patch_size if "test" not in self.split else 0
# Remember we are cancelling the shift since we no longer pad
shift = 0
idx, xdx, ddx = int(idx) + shift, int(xdx) + shift, int(ddx) + shift
idx, xdx, ddx = int(idx), int(xdx), int(ddx)
if direction == "i":
im = self.seismic[idx, :, xdx : xdx + self.patch_size, ddx : ddx + self.patch_size]
@ -769,7 +693,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
outdir = f"debug/patchLoaderWithSectionDepth_{self.split}_raw"
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
image_to_disk(im[0, :, :], path_prefix + "_img.png")
image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.augmentations is not None:
@ -783,7 +707,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
outdir = f"patchLoaderWithSectionDepth_{self.split}_{'aug' if self.augmentations is not None else 'noaug'}"
generate_path(outdir)
path_prefix = f"{outdir}/{index}"
image_to_disk(im[0, :, :], path_prefix + "_img.png")
image_to_disk(im[0, :, :], path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(lbl, path_prefix + "_lbl.png", self.n_classes)
if self.is_transform:
@ -796,7 +720,7 @@ class TrainPatchLoaderWithSectionDepth(TrainPatchLoader):
)
generate_path(outdir)
path_prefix = f"{outdir}/index_{index}_section_{patch_name}"
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png")
image_to_disk(np.array(im[0, :, :]), path_prefix + "_img.png", self.MIN, self.MAX)
mask_to_disk(np.array(lbl[0, :, :]), path_prefix + "_lbl.png", self.n_classes)
return im, lbl
@ -812,8 +736,6 @@ _TRAIN_PATCH_LOADERS = {
"patch": TrainPatchLoaderWithDepth,
}
_TRAIN_SECTION_LOADERS = {"section": TrainSectionLoaderWithDepth}
def get_patch_loader(cfg):
assert str(cfg.TRAIN.DEPTH).lower() in [
@ -825,6 +747,9 @@ def get_patch_loader(cfg):
return _TRAIN_PATCH_LOADERS.get(cfg.TRAIN.DEPTH, TrainPatchLoader)
_TRAIN_SECTION_LOADERS = {"section": TrainSectionLoaderWithDepth}
def get_section_loader(cfg):
assert str(cfg.TRAIN.DEPTH).lower() in [
"section",

Просмотреть файл

@ -6,7 +6,11 @@ Tests for TrainLoader and TestLoader classes when overriding the file names of t
import tempfile
import numpy as np
from interpretation.deepseismic_interpretation.dutchf3.data import get_test_loader, TrainPatchLoaderWithDepth, TrainSectionLoaderWithDepth
from deepseismic_interpretation.dutchf3.data import (
get_test_loader,
TrainPatchLoaderWithDepth,
TrainSectionLoaderWithDepth,
)
import pytest
import yacs.config
import os
@ -15,8 +19,10 @@ import os
IL = 5
XL = 10
D = 8
N_CLASSES = 2
CONFIG_FILE = "./experiments/interpretation/dutchf3_patch/configs/unet.yaml"
CONFIG_FILE = "./examples/interpretation/notebooks/configs/unet.yaml"
with open(CONFIG_FILE, "rt") as f_read:
config = yacs.config.load_cfg(f_read)
@ -52,10 +58,11 @@ def test_TestSectionLoader_should_load_data_from_test1_set():
generate_npy_files(os.path.join(data_dir, "test_once", "test1_labels.npy"), labels)
txt_path = os.path.join(data_dir, "splits", "section_test1.txt")
open(txt_path, 'a').close()
open(txt_path, "a").close()
TestSectionLoader = get_test_loader(config)
test_set = TestSectionLoader(data_dir = data_dir, split = 'test1')
config.merge_from_list(["DATASET.ROOT", data_dir])
test_set = TestSectionLoader(config, split="test1")
assert_dimensions(test_set)
@ -74,10 +81,11 @@ def test_TestSectionLoader_should_load_data_from_test2_set():
generate_npy_files(os.path.join(data_dir, "test_once", "test2_labels.npy"), labels)
txt_path = os.path.join(data_dir, "splits", "section_test2.txt")
open(txt_path, 'a').close()
open(txt_path, "a").close()
TestSectionLoader = get_test_loader(config)
test_set = TestSectionLoader(data_dir = data_dir, split = 'test2')
config.merge_from_list(["DATASET.ROOT", data_dir])
test_set = TestSectionLoader(config, split="test2")
assert_dimensions(test_set)
@ -94,143 +102,21 @@ def test_TestSectionLoader_should_load_data_from_path_override_data():
generate_npy_files(os.path.join(data_dir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(data_dir, "splits", "section_volume_name.txt")
open(txt_path, 'a').close()
open(txt_path, "a").close()
TestSectionLoader = get_test_loader(config)
test_set = TestSectionLoader(data_dir = data_dir,
split = "volume_name",
is_transform = True,
augmentations = None,
seismic_path = os.path.join(data_dir, "volume_name", "seismic.npy"),
label_path = os.path.join(data_dir, "volume_name", "labels.npy"))
config.merge_from_list(["DATASET.ROOT", data_dir])
test_set = TestSectionLoader(
config,
split="volume_name",
is_transform=True,
augmentations=None,
seismic_path=os.path.join(data_dir, "volume_name", "seismic.npy"),
label_path=os.path.join(data_dir, "volume_name", "labels.npy"),
)
assert_dimensions(test_set)
def test_TrainSectionLoaderWithDepth_should_fail_on_empty_file_names(tmpdir):
"""
Check for exception when files do not exist
"""
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainSectionLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
is_transform=True,
augmentations=None,
seismic_path = "",
label_path = ""
)
assert "does not exist" in str(excinfo.value)
def test_TrainSectionLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
"""
Check for exception when training param is empty
"""
# Setup
os.makedirs(os.path.join(tmpdir, "volume_name"))
os.makedirs(os.path.join(tmpdir, "splits"))
labels = np.ones([IL, XL, D])
generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
open(txt_path, 'a').close()
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainSectionLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
is_transform=True,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
)
assert "does not exist" in str(excinfo.value)
def test_TrainSectionLoaderWithDepth_should_fail_on_missing_label_file(tmpdir):
"""
Check for exception when training param is empty
"""
# Setup
os.makedirs(os.path.join(tmpdir, "volume_name"))
os.makedirs(os.path.join(tmpdir, "splits"))
labels = np.ones([IL, XL, D])
generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
open(txt_path, 'a').close()
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainSectionLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
is_transform=True,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
)
assert "does not exist" in str(excinfo.value)
def test_TrainSectionLoaderWithDepth_should_load_with_one_train_and_label_file(tmpdir):
"""
Check for successful class instantiation w/ single npy file for train & label
"""
# Setup
os.makedirs(os.path.join(tmpdir, "volume_name"))
os.makedirs(os.path.join(tmpdir, "splits"))
seimic = np.zeros([IL, XL, D])
generate_npy_files(os.path.join(tmpdir, "volume_name", "seismic.npy"), seimic)
labels = np.ones([IL, XL, D])
generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(tmpdir, "splits", "section_volume_name.txt")
open(txt_path, 'a').close()
# Test
train_set = TrainSectionLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
is_transform=True,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
)
assert train_set.labels.shape == (IL, XL, D)
assert train_set.seismic.shape == (IL, 3, XL, D)
def test_TrainPatchLoaderWithDepth_should_fail_on_empty_file_names(tmpdir):
"""
Check for exception when files do not exist
"""
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainPatchLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
is_transform=True,
stride=25,
patch_size=100,
augmentations=None,
seismic_path = "",
label_path = ""
)
assert "does not exist" in str(excinfo.value)
def test_TrainPatchLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
"""
@ -244,20 +130,20 @@ def test_TrainPatchLoaderWithDepth_should_fail_on_missing_seismic_file(tmpdir):
generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
open(txt_path, 'a').close()
open(txt_path, "a").close()
config.merge_from_list(["DATASET.ROOT", str(tmpdir)])
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainPatchLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
config,
split="volume_name",
is_transform=True,
stride=25,
patch_size=100,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
)
assert "does not exist" in str(excinfo.value)
@ -274,20 +160,20 @@ def test_TrainPatchLoaderWithDepth_should_fail_on_missing_label_file(tmpdir):
generate_npy_files(os.path.join(tmpdir, "volume_name", "seismic.npy"), seimic)
txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
open(txt_path, 'a').close()
open(txt_path, "a").close()
config.merge_from_list(["DATASET.ROOT", str(tmpdir)])
# Test
with pytest.raises(Exception) as excinfo:
_ = TrainPatchLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
config,
split="volume_name",
is_transform=True,
stride=25,
patch_size=100,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
)
assert "does not exist" in str(excinfo.value)
@ -306,20 +192,21 @@ def test_TrainPatchLoaderWithDepth_should_load_with_one_train_and_label_file(tmp
labels = np.ones([IL, XL, D])
generate_npy_files(os.path.join(tmpdir, "volume_name", "labels.npy"), labels)
txt_path = os.path.join(tmpdir, "splits", "patch_volume_name.txt")
open(txt_path, 'a').close()
txt_dir = os.path.join(tmpdir, "splits")
txt_path = os.path.join(txt_dir, "patch_volume_name.txt")
open(txt_path, "a").close()
config.merge_from_list(["DATASET.ROOT", str(tmpdir)])
# Test
train_set = TrainPatchLoaderWithDepth(
data_dir = tmpdir,
split = "volume_name",
config,
split="volume_name",
is_transform=True,
stride=25,
patch_size=100,
augmentations=None,
seismic_path=os.path.join(tmpdir, "volume_name", "seismic.npy"),
label_path=os.path.join(tmpdir, "volume_name", "labels.npy")
label_path=os.path.join(tmpdir, "volume_name", "labels.npy"),
)
assert train_set.labels.shape == (IL, XL, D)
assert train_set.seismic.shape == (IL, XL, D)
assert train_set.labels.shape == (IL, XL, D + 2 * config.TRAIN.PATCH_SIZE)
assert train_set.seismic.shape == (IL, XL, D + 2 * config.TRAIN.PATCH_SIZE)

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -18,7 +18,7 @@ def _torch_hist(label_true, label_pred, n_class):
Returns:
[type]: [description]
"""
assert len(label_true.shape) == 1, "Labels need to be 1D"
assert len(label_pred.shape) == 1, "Predictions need to be 1D"
mask = (label_true >= 0) & (label_true < n_class)

Просмотреть файл

@ -0,0 +1,148 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Utility Script to convert segy files to blocks of numpy arrays and save to individual npy files
"""
import os
import timeit
import argparse
import numpy as np
from deepseismic_interpretation.segyconverter.utils import segyextract, dataprep
import json
K = 12
MIN_VAL = 0
MAX_VAL = 1
def filter_data(output_dir, stddev_file, k, min_range, max_range, clip, normalize):
"""
Normalization step on all files in output_dir. This function overwrites the existing
data file
:param str output_dir: Directory path of all npy files to normalize
:param str stddev_file: txt file containing standard deviation result
:param int k: number of standard deviation to be used in normalization
:param float min_range: minium range value
:param float max_range: maximum range value
:param clip: flag to turn on/off clip
:param normalize: flag to turn on/off normalization.
"""
txt_file = os.path.join(output_dir, stddev_file)
if not os.path.isfile(txt_file):
raise Exception("Std Deviation file could not be found")
with open(os.path.join(txt_file), "r") as f:
metadatastr = f.read()
try:
metadata = json.loads(metadatastr)
stddev = float(metadata["stddev"])
mean = float(metadata["mean"])
except ValueError:
raise Exception("stddev value not valid: {}".format(metadatastr))
npy_files = list(f for f in os.listdir(output_dir) if f.endswith(".npy"))
for local_filename in npy_files:
cube = np.load(os.path.join(output_dir, local_filename))
if normalize or clip:
cube = dataprep.apply(cube, stddev, mean, k, min_range, max_range, clip=clip, normalize=normalize)
np.save(os.path.join(output_dir, local_filename), cube)
def main(
input_file,
output_dir,
prefix,
iline=189,
xline=193,
metadata_only=False,
stride=128,
cube_size=-1,
normalize=True,
clip=True,
):
"""
Select a single column out of the segy file and generate all cubes in the z(time)
direction. The column is indexed by the inline and xline. To use this command, you
should have already run the metadata extract to determine the
ranges of the inlines and xlines. It will error out if the range is incorrect
Sample call: python3 convert_segy.py --input_file
seismic_data.segy --prefix seismic --output_dir ./seismic
:param str input_file: input segy file path
:param str output_dir: output directory to save npy files
:param str prefix: file prefix for npy files
:param int iline: byte location for inlines
:param int xline: byte location for crosslines
:param bool metadata_only: Only return the metadata of the segy file
:param int stride: overlap between cubes - stride == cube_size = no overlap
:param int cube_size: size of cubes to generate
"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
fast_indexes, slow_indexes, trace_headers, sample_size = segyextract.get_segy_metadata(input_file, iline, xline)
print("\tFast Lines: {} to {} ({} lines)".format(np.min(fast_indexes), np.max(fast_indexes), len(fast_indexes)))
print("\tSlow Lines: {} to {} ({} lines)".format(np.min(slow_indexes), np.max(slow_indexes), len(slow_indexes)))
print("\tSample Size: {}".format(sample_size))
print("\tTrace Count: {}".format(len(trace_headers)))
print("\tFirst five distinct Fast Line Indexes: {}".format(fast_indexes[0:5]))
print("\tFirst five distinct Slow Line Indexes: {}".format(slow_indexes[0:5]))
print("\tFirst five fast trace ids: {}".format(trace_headers["fast"][0:5].values))
print("\tFirst five slow trace ids: {}".format(trace_headers["slow"][0:5].values))
if not metadata_only:
process_time_segy = 0
if cube_size == -1:
# only generate on npy
wrapped_processor_segy = segyextract.timewrapper(
segyextract.process_segy_data_into_single_array, input_file, output_dir, prefix, iline, xline
)
process_time_segy = timeit.timeit(wrapped_processor_segy, number=1)
else:
wrapped_processor_segy = segyextract.timewrapper(
segyextract.process_segy_data, input_file, output_dir, prefix, stride=stride, n_points=cube_size
)
process_time_segy = timeit.timeit(wrapped_processor_segy, number=1)
print(f"Completed SEG-Y converstion in: {process_time_segy}")
# At this point, there should be npy files in the output directory + one file containing the std deviation found in the segy
print("Preparing File")
timed_filter_data = segyextract.timewrapper(
filter_data, output_dir, f"{prefix}_stats.json", K, MIN_VAL, MAX_VAL, clip=clip, normalize=normalize
)
process_time_normalize = timeit.timeit(timed_filter_data, number=1)
print(f"Completed file preparation in {process_time_normalize} seconds")
if __name__ == "__main__":
parser = argparse.ArgumentParser("train")
parser.add_argument("--prefix", type=str, help="prefix label for output files", required=True)
parser.add_argument("--input_file", type=str, help="segy file path", required=True)
parser.add_argument("--output_dir", type=str, help="Output files are written to this directory", default=".")
parser.add_argument("--metadata_only", action="store_true", help="Only produce inline,xline metadata")
parser.add_argument("--iline", type=int, default=189, help="segy file path")
parser.add_argument("--xline", type=int, default=193, help="segy file path")
parser.add_argument("--cube_size", type=int, default=-1, help="cube dimensions")
parser.add_argument("--stride", type=int, default=128, help="stride")
parser.add_argument("--normalize", action="store_true", help="Normalization flag - clip and normalize the data")
parser.add_argument("--clip", action="store_true", help="Clipping flag - only clip the data")
args = parser.parse_args()
localfile = args.input_file
main(
args.input_file,
args.output_dir,
args.prefix,
args.iline,
args.xline,
args.metadata_only,
args.stride,
args.cube_size,
args.normalize,
args.clip,
)

Просмотреть файл

@ -0,0 +1,212 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Test that the current scripts can run from the command line
"""
import os
import numpy as np
from deepseismic_interpretation.segyconverter import convert_segy
from deepseismic_interpretation.segyconverter.test import test_util
import pytest
import segyio
MAX_RANGE = 1
MIN_RANGE = 0
ERROR_EXIT_CODE = 99
@pytest.fixture(scope="class")
def segy_single_file(request):
# setup code
# create segy file
inlinefile = "./inlinesortsample.segy"
test_util.create_segy_file(
lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
inlinefile,
segyio.TraceSortingFormat.INLINE_SORTING,
)
# inject class variables
request.cls.testfile = inlinefile
yield
# teardown code
os.remove(inlinefile)
@pytest.mark.usefixtures("segy_single_file")
class TestConvertSEGY:
testfile = None # Set by segy_file fixture
def test_convert_segy_generates_single_npy(self, tmpdir):
# Setup
prefix = "volume1"
input_file = self.testfile
output_dir = tmpdir.strpath
metadata_only = False
iline = 189
xline = 193
cube_size = -1
stride = 128
normalize = True
clip = True
inputpath = ""
# Test
convert_segy.main(
input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
)
# Validate
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 1
min_val, max_val = _get_min_max(tmpdir.strpath)
assert min_val >= MIN_RANGE
assert max_val <= MAX_RANGE
def test_convert_segy_generates_multiple_npy_files(self, tmpdir):
"""
Run process_all_files and checks that it returns with 0 exit code
:param function filedir: fixture for setup and cleanup
"""
# Setup
prefix = "volume1"
input_file = self.testfile
output_dir = tmpdir.strpath
metadata_only = False
iline = 189
xline = 193
cube_size = 128
stride = 128
normalize = True
inputpath = ""
clip = True
# Test
convert_segy.main(
input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
)
# Validate
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 2
def test_convert_segy_normalizes_data(self, tmpdir):
"""
Run process_all_files and checks that it returns with 0 exit code
:param function filedir: fixture for setup and cleanup
"""
# Setup
prefix = "volume1"
input_file = self.testfile
output_dir = tmpdir.strpath
metadata_only = False
iline = 189
xline = 193
cube_size = 128
stride = 128
normalize = True
inputpath = ""
clip = True
# Test
convert_segy.main(
input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
)
# Validate
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 2
min_val, max_val = _get_min_max(tmpdir.strpath)
assert min_val >= MIN_RANGE
assert max_val <= MAX_RANGE
def test_convert_segy_clips_data(self, tmpdir):
"""
Run process_all_files and checks that it returns with 0 exit code
:param function filedir: fixture for setup and cleanup
"""
# Setup
prefix = "volume1"
input_file = self.testfile
output_dir = tmpdir.strpath
metadata_only = False
iline = 189
xline = 193
cube_size = 128
stride = 128
normalize = False
inputpath = ""
clip = True
# Test
convert_segy.main(
input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
)
# Validate
expected_max = 35.59
expected_min = -35.59
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 2
min_val, max_val = _get_min_max(tmpdir.strpath)
assert expected_min == pytest.approx(min_val, rel=1e-3)
assert expected_max == pytest.approx(max_val, rel=1e-3)
def test_convert_segy_copies_exact_data_with_no_normalization(self, tmpdir):
"""
Run process_all_files and checks that it returns with 0 exit code
:param function filedir: fixture for setup and cleanup
"""
# Setup
prefix = "volume1"
input_file = self.testfile
output_dir = tmpdir.strpath
metadata_only = False
iline = 189
xline = 193
cube_size = 128
stride = 128
normalize = False
inputpath = ""
clip = False
# Test
convert_segy.main(
input_file, output_dir, prefix, iline, xline, metadata_only, stride, cube_size, normalize, clip
)
# Validate
expected_max = 1039.8
expected_min = -1039.8
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 2
min_val, max_val = _get_min_max(tmpdir.strpath)
assert expected_min == pytest.approx(min_val, rel=1e-3)
assert expected_max == pytest.approx(max_val, rel=1e-3)
def _get_min_max(outputdir):
"""
Check # of npy files in directory
:param str outputdir: directory to check for npy files
:returns: min_val, max_val of values in npy files
:rtype: int, int
"""
min_val = 0
max_val = 0
npy_files = test_util.get_npy_files(outputdir)
for file in npy_files:
data = np.load(os.path.join(outputdir, file))
this_min = np.amin(data)
this_max = np.amax(data)
if this_min < min_val:
min_val = this_min
if this_max > max_val:
max_val = this_max
return min_val, max_val

Просмотреть файл

@ -0,0 +1,166 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Test data normalization
"""
import numpy as np
from deepseismic_interpretation.segyconverter.utils import dataprep
import pytest
INPUT_FOLDER = "./contrib/segyconverter/test/test_data"
MAX_RANGE = 1
MIN_RANGE = 0
K = 12
class TestNormalizeCube:
testcube = None # Set by npy_files fixture
def test_normalize_cube_returns_normalized_values(self):
"""
Test method that normalize one cube by checking if normalized
values are within [min, max] range.
"""
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
# Add values to clip
cube[40, 25, 50] = 700
cube[70, 30, 70] = -700
mean = np.mean(cube)
variance = np.var(cube)
stddev = np.sqrt(variance)
min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
norm_block = dataprep.normalize_cube(cube, min_clip, max_clip, scale, MIN_RANGE, MAX_RANGE)
assert np.amax(norm_block) <= MAX_RANGE
assert np.amin(norm_block) >= MIN_RANGE
def test_clip_cube_returns_clipped_values(self):
"""
Test method that clip one cube by checking if clipped
values are within [min_clip, max_clip] range.
"""
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
# Add values to clip
cube[40, 25, 50] = 700
cube[70, 30, 70] = -700
mean = np.mean(cube)
variance = np.var(cube)
stddev = np.sqrt(variance)
min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
clipped_block = dataprep.clip_cube(cube, min_clip, max_clip)
assert np.amax(clipped_block) <= max_clip
assert np.amin(clipped_block) >= min_clip
def test_norm_value_is_correct(self):
# Check if normalized value is calculated correctly
min_clip = -18469.875210304104
max_clip = 18469.875210304104
scale = 2.707110872741882e-05
input_value = 2019
expected_norm_value = 0.5546565685206586
norm_v = dataprep.norm_value(input_value, min_clip, max_clip, MIN_RANGE, MAX_RANGE, scale)
assert norm_v == pytest.approx(expected_norm_value, rel=1e-3)
def test_clip_value_is_correct(self):
# Check if normalized value is calculated correctly
min_clip = -18469.875210304104
max_clip = 18469.875210304104
input_value = 2019
expected_clipped_value = 2019
clipped_v = dataprep.clip_value(input_value, min_clip, max_clip)
assert clipped_v == pytest.approx(expected_clipped_value, rel=1e-3)
def test_norm_value_on_cube_is_within_range(self):
# Check if normalized value is within [MIN_RANGE, MAX_RANGE]
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
cube[40, 25, 50] = 7000
cube[70, 30, 70] = -7000
variance = np.var(cube)
stddev = np.sqrt(variance)
mean = np.mean(cube)
v = cube[10, 40, 5]
min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
norm_v = dataprep.norm_value(v, min_clip, max_clip, MIN_RANGE, MAX_RANGE, scale)
assert norm_v <= MAX_RANGE
assert norm_v >= MIN_RANGE
pytest.raises(Exception, dataprep.norm_value, v, min_clip * 10, max_clip * 10, MIN_RANGE, MAX_RANGE, scale * 10)
def test_clipped_value_on_cube_is_within_range(self):
# Check if clipped value is within [min_clip, max_clip]
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
cube[40, 25, 50] = 7000
cube[70, 30, 70] = -7000
variance = np.var(cube)
mean = np.mean(cube)
stddev = np.sqrt(variance)
v = cube[10, 40, 5]
min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
clipped_v = dataprep.clip_value(v, min_clip, max_clip)
assert clipped_v <= max_clip
assert clipped_v >= min_clip
def test_compute_statistics(self):
# Check if statistics are calculated correctly for provided stddev, max_range and k values
expected_min_clip = -138.693888
expected_max_clip = 138.693888
expected_scale = 0.003605061529459755
mean = 0
stddev = 11.557824
min_clip, max_clip, scale = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
assert expected_min_clip == pytest.approx(min_clip, rel=1e-3)
assert expected_max_clip == pytest.approx(max_clip, rel=1e-3)
assert expected_scale == pytest.approx(scale, rel=1e-3)
# Testing division by zero
pytest.raises(Exception, dataprep.compute_statistics, stddev, MAX_RANGE, 0)
pytest.raises(Exception, dataprep.compute_statistics, 0, MAX_RANGE, 0)
def test_apply_should_clip_and_normalize_data(self):
# Check that apply method will clip and normalize the data
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
cube[40, 25, 50] = 7000
cube[70, 30, 70] = -7000
variance = np.var(cube)
stddev = np.sqrt(variance)
mean = np.mean(cube)
norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE)
assert np.amax(norm_block) <= MAX_RANGE
assert np.amin(norm_block) >= MIN_RANGE
norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE, clip=False)
assert np.amax(norm_block) <= MAX_RANGE
assert np.amin(norm_block) >= MIN_RANGE
pytest.raises(Exception, dataprep.apply, cube, stddev, 0, MIN_RANGE, MAX_RANGE)
pytest.raises(Exception, dataprep.apply, cube, 0, K, MIN_RANGE, MAX_RANGE)
invalid_cube = np.empty_like(cube)
invalid_cube[:] = np.nan
pytest.raises(Exception, dataprep.apply, invalid_cube, stddev, 0, MIN_RANGE, MAX_RANGE)
def test_apply_should_clip_data(self):
# Check that apply method will clip the data
trace = np.linspace(-1, 1, 100, True, dtype=np.single)
cube = np.ones((100, 50, 100)) * trace * 500
cube[40, 25, 50] = 7000
cube[70, 30, 70] = -7000
variance = np.var(cube)
stddev = np.sqrt(variance)
mean = np.mean(cube)
min_clip, max_clip, _ = dataprep.compute_statistics(stddev, mean, MAX_RANGE, K)
norm_block = dataprep.apply(cube, stddev, mean, K, MIN_RANGE, MAX_RANGE, clip=True, normalize=False)
assert np.amax(norm_block) <= max_clip
assert np.amin(norm_block) >= min_clip
invalid_cube = np.empty_like(cube)
invalid_cube[:] = np.nan
pytest.raises(
Exception, dataprep.apply, invalid_cube, stddev, 0, MIN_RANGE, MAX_RANGE, clip=True, normalize=False
)

Просмотреть файл

@ -0,0 +1,317 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Test the extract functions against a variety of SEGY files and trace_header scenarioes
"""
import os
import pytest
import numpy as np
import pandas as pd
from deepseismic_interpretation.segyconverter.utils import segyextract
from deepseismic_interpretation.segyconverter.test import test_util
import segyio
import json
FILENAME = "./normalsegy.segy"
PREFIX = "normal"
@pytest.fixture(scope="class")
def segy_all_files(request):
# setup code
# create segy file
normal_filename = "./normalsegy.segy"
test_util.create_segy_file(lambda il, xl: True, normal_filename)
request.cls.control_file = normal_filename # Set by segy_file fixture
inline_filename = "./inlineerror.segy"
test_util.create_segy_file(
lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
inline_filename,
segyio.TraceSortingFormat.INLINE_SORTING,
)
request.cls.inline_sort_file = inline_filename # Set by segy_file fixture
xline_filename = "./xlineerror.segy"
test_util.create_segy_file(
lambda il, xl: not ((il < 20 and xl < 125) or (il > 40 and xl > 250)),
xline_filename,
segyio.TraceSortingFormat.CROSSLINE_SORTING,
)
request.cls.crossline_sort_file = xline_filename # Set by segy_file fixture
hole_filename = "./hole.segy"
test_util.create_segy_file(
lambda il, xl: not ((20 < il < 30) and (150 < xl < 250)),
hole_filename,
segyio.TraceSortingFormat.INLINE_SORTING,
)
request.cls.hole_file = hole_filename # Set by segy_file fixture
yield
# teardown code
os.remove(normal_filename)
os.remove(inline_filename)
os.remove(xline_filename)
os.remove(hole_filename)
@pytest.mark.usefixtures("segy_all_files")
class TestSEGYExtract:
control_file = None # Set by segy_file fixture
inline_sort_file = None # Set by segy_file fixture
crossline_sort_file = None # Set by segy_file fixture
hole_file = None # Set by segy_file fixture
@pytest.mark.parametrize(
"filename, trace_count, first_inline, inline_count, first_xline, xline_count, depth",
[
("./normalsegy.segy", 8000, 10, 40, 100, 200, 10),
("./inlineerror.segy", 7309, 10, 40, 125, 200, 10),
("./xlineerror.segy", 7309, 10, 40, 125, 200, 10),
("./hole.segy", 7109, 10, 40, 100, 200, 10),
],
)
def test_get_segy_metadata_should_return_correct_metadata(
self, filename, trace_count, first_inline, inline_count, first_xline, xline_count, depth
):
"""
Check that get_segy_metadata can correctly identify the sorting from the trace headers
:param dict tmpdir: pytest fixture for local test directory cleanup
:param str filename: SEG-Y filename
:param int inline: byte location for inline
:param int xline: byte location for crossline
:param int depth: number of samples
"""
# setup
inline_byte_loc = 189
xline_byte_loc = 193
# test
fast_indexes, slow_indexes, trace_headers, sample_size = segyextract.get_segy_metadata(
filename, inline_byte_loc, xline_byte_loc
)
# validate
assert sample_size == depth
assert len(trace_headers) == trace_count
assert len(fast_indexes) == inline_count
assert len(slow_indexes) == xline_count
# Check fast direction
assert trace_headers["slow"][0] == first_xline
assert trace_headers["fast"][0] == first_inline
@pytest.mark.parametrize(
"filename,inline,xline,depth",
[
("./normalsegy.segy", 40, 200, 10),
("./inlineerror.segy", 40, 200, 10),
("./xlineerror.segy", 40, 200, 10),
("./hole.segy", 40, 200, 10),
],
)
def test_process_segy_data_should_create_cube_size_equal_to_segy(self, tmpdir, filename, inline, xline, depth):
"""
Create single npy file for segy and validate size
:param dict tmpdir: pytest fixture for local test directory cleanup
:param str filename: SEG-Y filename
:param int inline: byte location for inline
:param int xline: byte location for crossline
:param int depth: number of samples
"""
segyextract.process_segy_data_into_single_array(filename, tmpdir.strpath, PREFIX)
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 1
data = np.load(os.path.join(tmpdir.strpath, npy_files[0]))
assert len(data.shape) == 3
assert data.shape[0] == inline
assert data.shape[1] == xline
assert data.shape[2] == depth
def test_process_segy_data_should_write_npy_files_for_n_equals_128_stride_64(self, tmpdir):
"""
Break data up into size n=128 size blocks and validate against original segy
file. This size of block causes the code to write 1 x 4 npy files
:param function tmpdir: pytest fixture for local test directory cleanup
"""
# setup
n_points = 128
stride = 64
# test
segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=stride)
# validate
_output_npy_files_are_correct_for_cube_size(4, 128, tmpdir.strpath)
def test_process_segy_data_should_write_npy_files_for_n_equals_128(self, tmpdir):
"""
Break data up into size n=128 size blocks and validate against original segy
file. This size of block causes the code to write 1 x 4 npy files
:param function tmpdir: pytest fixture for local test directory cleanup
"""
# setup
n_points = 128
# test
segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX)
# validate
npy_files = _output_npy_files_are_correct_for_cube_size(2, 128, tmpdir.strpath)
full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
# Validate contents of volume
_compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
def test_process_segy_data_should_write_npy_files_for_n_equals_64(self, tmpdir):
"""
Break data up into size n=64 size blocks and validate against original segy
file. This size of block causes the code to write 1 x 8 npy files
:param function tmpdir: pytest fixture for local test directory cleanup
"""
# setup
n_points = 64
expected_file_count = 4
# test
segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=n_points)
# validate
npy_files = _output_npy_files_are_correct_for_cube_size(expected_file_count, n_points, tmpdir.strpath)
full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
# Validate contents of volume
_compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
def test_process_segy_data_should_write_npy_files_for_n_equals_16(self, tmpdir):
"""
Break data up into size n=16 size blocks and validate against original segy
file. This size of block causes the code to write 2 x 4 x 32 npy files.
:param function tmpdir: pytest fixture for local test directory cleanup
"""
# setup
n_points = 16
# test
segyextract.process_segy_data(FILENAME, tmpdir.strpath, PREFIX, n_points=n_points, stride=n_points)
# validate
npy_files = _output_npy_files_are_correct_for_cube_size(39, 16, tmpdir.strpath)
full_volume_from_file = test_util.build_volume(n_points, npy_files, tmpdir.strpath)
_compare_variance(FILENAME, PREFIX, full_volume_from_file, tmpdir.strpath)
def test_process_npy_file_should_have_same_content_as_segy(self, tmpdir):
"""
Check the actual content of a npy file generated from the segy
:param function tmpdir: pytest fixture for local test directory cleanup
"""
segyextract.process_segy_data_into_single_array(FILENAME, tmpdir.strpath, PREFIX)
npy_files = test_util.get_npy_files(tmpdir.strpath)
assert len(npy_files) == 1
data = np.load(os.path.join(tmpdir.strpath, npy_files[0]))
_compare_output_to_segy(FILENAME, data, 40, 200, 10)
def test_remove_duplicates_should_keep_order(self):
# setup
list_with_dups = [1, 2, 3, 3, 5, 8, 4, 2]
# test
result = segyextract._remove_duplicates(list_with_dups)
# validate
expected_result = [1, 2, 3, 5, 8, 4]
assert all([a == b for a, b in zip(result, expected_result)])
def test_identify_fast_direction_should_handle_xline_sequence_1(self):
# setup
df = pd.DataFrame({"i": [101, 102, 102, 102, 103, 103], "j": [301, 301, 302, 303, 301, 302]})
# test
segyextract._identify_fast_direction(df, "fast", "slow")
# validate
assert df.keys()[0] == "fast"
assert df.keys()[1] == "slow"
def test_identify_fast_direction_should_handle_xline_sequence_2(self):
# setup
df = pd.DataFrame({"i": [101, 102, 102, 102, 102, 102], "j": [301, 301, 302, 303, 304, 305]})
# test
segyextract._identify_fast_direction(df, "fast", "slow")
# validate
assert df.keys()[0] == "fast"
assert df.keys()[1] == "slow"
def _output_npy_files_are_correct_for_cube_size(expected_count, cube_size, outputdir):
"""
Check # of npy files in directory
:param int expected_count: expected # of npy files
:param str outputdir: directory to check for npy files
:param int cube_size: size of cube array
:returns: npy_files in outputdir
:rtype: list
"""
npy_files = test_util.get_npy_files(outputdir)
assert len(npy_files) == expected_count
data = np.load(os.path.join(outputdir, npy_files[0]))
assert len(data.shape) == 3
assert data.shape.count(cube_size) == 3
return npy_files
def _compare_output_to_segy(filename, data, fast_size, slow_size, depth):
"""
Compares each trace in the segy file to the data volume that
was generated from the npy file. This only works when a single npy
is created from a cuboid SEGY. If the dimensions are not aligned
:param str filename: path to segy file
:param nparray data: data read in from npy files
"""
with segyio.open(filename, ignore_geometry=True) as segy_file:
segy_file.mmap()
segy_sum = np.float32(0.0)
npy_sum = np.float32(0.0)
# Validate that each trace in the segy file is represented in the npy files
# Sum traces in segy and npy to ensure they are correct
for j in range(0, fast_size): # Fast
for i in range(0, slow_size): # Slow
trace = segy_file.trace[i + (j * slow_size)]
data_trace = data[j, i, :]
assert all([a == b for a, b in zip(trace, data_trace)]), f"Unmatched trace at {j}:{i}"
segy_sum += np.sum(trace, dtype=np.float32)
npy_sum += np.sum(data_trace, dtype=np.float32)
assert segy_sum == npy_sum
def _compare_variance(filename, prefix, data, outputdir):
"""
Compares the standard deviation calculated from the full volume to the
standard deviation calculated while creating the npy files
:param str filename: path to segy file
:param str prefix: prefix used to find files
:param nparray data: data read in from npy files
:param str outputdir: location of npy files
"""
with segyio.open(filename, ignore_geometry=True) as segy_file:
segy_file.mmap()
segy_stddev = np.sqrt(np.var(data))
# Check statistics file generated from segy
with open(os.path.join(outputdir, prefix + "_stats.json"), "r") as f:
metadatastr = f.read()
metadata = json.loads(metadatastr)
stddev = float(metadata["stddev"])
assert round(stddev) == round(segy_stddev)

Просмотреть файл

@ -0,0 +1,110 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Utility functions for pytest
"""
import numpy as np
import os
import segyio
def is_npy(s):
"""
Filter check for npy files
:param str s: file path
:returns: True if npy
:rtype: bool
"""
if s.find(".npy") == -1:
return False
else:
return True
def get_npy_files(outputdir):
"""
List npy files
:param str outputdir: location of npy files
:returns: npy_files
:rtype: list
"""
npy_files = os.listdir(outputdir)
npy_files = list(filter(is_npy, npy_files))
npy_files.sort()
return npy_files
def build_volume(n_points, npy_files, file_location):
"""
Rebuild volume from npy files. This only works for a vertical column of
npy files. If there is a cube of files, then a new algorithm will be required to
stitch them back together
:param int n_points: size of cube expected in npy_files
:param list npy_files: list of files to load into vertical volume
:param str file_location: directory for npy files to add to array
:returns: numpy array created by stacking the npy_file arrays vertically (third axis)
:rtype: numpy.array
"""
full_volume_from_file = np.zeros((n_points, n_points, n_points * len(npy_files)), dtype=np.float32)
for i, file in enumerate(npy_files):
data = np.load(os.path.join(file_location, file))
full_volume_from_file[:, :, n_points * i : n_points * (i + 1)] = data
return full_volume_from_file
def create_segy_file(
masklambda, filename, sorting=segyio.TraceSortingFormat.INLINE_SORTING, ilinerange=[10, 50], xlinerange=[100, 300]
):
# segyio.spec is the minimum set of values for a valid segy file.
spec = segyio.spec()
spec.sorting = 2
spec.format = 1
spec.samples = range(int(10))
spec.ilines = range(*map(int, ilinerange))
spec.xlines = range(*map(int, xlinerange))
print(f"Written to {filename}")
print(f"\tinlines: {len(spec.ilines)}")
print(f"\tcrosslines: {len(spec.xlines)}")
with segyio.create(filename, spec) as f:
# one inline consists of 50 traces
# which in turn consists of 2000 samples
step = 0.00001
start = step * len(spec.samples)
# fill a trace with predictable values: left-of-comma is the inline
# number. Immediately right of comma is the crossline number
# the rightmost digits is the index of the sample in that trace meaning
# looking up an inline's i's jth crosslines' k should be roughly equal
# to i.j0k
trace = np.linspace(-1, 1, len(spec.samples), True, dtype=np.single)
if sorting == segyio.TraceSortingFormat.INLINE_SORTING:
# Write the file trace-by-trace and update headers with iline, xline
# and offset
tr = 0
for il in spec.ilines:
for xl in spec.xlines:
if masklambda(il, xl):
f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
f.trace[tr] = trace * ((xl / 100.0) + il)
tr += 1
f.bin.update(tsort=segyio.TraceSortingFormat.INLINE_SORTING)
else:
# Write the file trace-by-trace and update headers with iline, xline
# and offset
tr = 0
for il in spec.ilines:
for xl in spec.xlines:
if masklambda(il, xl):
f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
f.trace[tr] = trace * (xl / 100.0) + il
tr += 1
f.bin.update(tsort=segyio.TraceSortingFormat.CROSSLINE_SORTING)
# Add some noise for clipping and normalization tests
f.trace[tr // 2] = trace * ((max(spec.xlines) / 100.0) + max(spec.ilines)) * 20
f.trace[tr // 3] = trace * ((min(spec.xlines) / 100.0) + min(spec.ilines)) * 20
print(f"\ttraces: {tr}")

Просмотреть файл

@ -0,0 +1,130 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import segyio
import numpy as np
from glob import glob
from os import listdir
import os
import pandas as pd
import re
import matplotlib.pyplot as pyplot
def parse_trace_headers(segyfile, n_traces):
"""
Parse the segy file trace headers into a pandas dataframe.
Column names are defined from segyio internal tracefield
One row per trace
"""
# Get all header keys
headers = segyio.tracefield.keys
# Initialize dataframe with trace id as index and headers as columns
df = pd.DataFrame(index=range(1, n_traces + 1), columns=headers.keys())
# Fill dataframe with all header values
for k, v in headers.items():
df[k] = segyfile.attributes(v)[:]
return df
def parse_text_header(segyfile):
"""
Format segy text header into a readable, clean dict
"""
raw_header = segyio.tools.wrap(segyfile.text[0])
# Cut on C*int pattern
cut_header = re.split(r"C ", raw_header)[1::]
# Remove end of line return
text_header = [x.replace("\n", " ") for x in cut_header]
text_header[-1] = text_header[-1][:-2]
# Format in dict
clean_header = {}
i = 1
for item in text_header:
key = "C" + str(i).rjust(2, "0")
i += 1
clean_header[key] = item
return clean_header
def show_segy_details(segyfile):
with segyio.open(segyfile, ignore_geometry=True) as segy:
segydf = parse_trace_headers(segy, segy.tracecount)
print(f"Loaded from file {segyfile}")
print(f"\tTracecount: {segy.tracecount}")
print(f"\tData Shape: {segydf.shape}")
print(f"\tSample length: {len(segy.samples)}")
pyplot.figure(figsize=(10, 6))
pyplot.scatter(segydf[["INLINE_3D"]], segydf[["CROSSLINE_3D"]], marker=",")
pyplot.xlabel("inline")
pyplot.ylabel("crossline")
pyplot.show()
def load_segy_with_geometry(segyfile):
try:
segy = segyio.open(segyfile, ignore_geometry=False)
segy.mmap()
print(f"Loaded with geometry: {segyfile} :")
print(f"\tNum samples per trace: {len(segy.samples)}")
print(f"\tNum traces in file: {segy.tracecount}")
except ValueError as ex:
print(f"Load failed with geometry: {segyfile} :")
print(ex)
def create_segy_file(
masklambda, filename, sorting=segyio.TraceSortingFormat.INLINE_SORTING, ilinerange=[10, 50], xlinerange=[100, 300]
):
spec = segyio.spec()
# to create a file from nothing, we need to tell segyio about the structure of
# the file, i.e. its inline numbers, crossline numbers, etc. You can also add
# more structural information, but offsets etc. have sensible defautls. This is
# the absolute minimal specification for a N-by-M volume
spec.sorting = 2
spec.format = 1
spec.samples = range(int(10))
spec.ilines = range(*map(int, ilinerange))
spec.xlines = range(*map(int, xlinerange))
print(f"Written to {filename}")
print(f"\tinlines: {len(spec.ilines)}")
print(f"\tcrosslines: {len(spec.xlines)}")
with segyio.create(filename, spec) as f:
# one inline consists of 50 traces
# which in turn consists of 2000 samples
step = 0.00001
start = step * len(spec.samples)
# fill a trace with predictable values: left-of-comma is the inline
# number. Immediately right of comma is the crossline number
# the rightmost digits is the index of the sample in that trace meaning
# looking up an inline's i's jth crosslines' k should be roughly equal
# to i.j0k
trace = np.linspace(-1, 1, len(spec.samples), True, dtype=np.single)
if sorting == segyio.TraceSortingFormat.INLINE_SORTING:
# Write the file trace-by-trace and update headers with iline, xline
# and offset
tr = 0
for il in spec.ilines:
for xl in spec.xlines:
if masklambda(il, xl):
f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
f.trace[tr] = trace * ((xl / 100.0) + il)
tr += 1
f.bin.update(tsort=segyio.TraceSortingFormat.CROSSLINE_SORTING)
else:
# Write the file trace-by-trace and update headers with iline, xline
# and offset
tr = 0
for il in spec.ilines:
for xl in spec.xlines:
if masklambda(il, xl):
f.header[tr] = {segyio.su.offset: 1, segyio.su.iline: il, segyio.su.xline: xl}
f.trace[tr] = trace + (xl / 100.0) + il
tr += 1
f.bin.update(tsort=segyio.TraceSortingFormat.INLINE_SORTING)
print(f"\ttraces: {tr}")

Просмотреть файл

@ -0,0 +1,140 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Utility Script to normalize one cube
"""
import numpy as np
def compute_statistics(stddev: float, mean: float, max_range: float, k: int):
"""
Compute min_clip, max_clip and scale values based on provided stddev, max_range and k values
:param stddev: standard deviation value
:param max_range: maximum value range
:param k: number of standard deviation to be used in normalization
:returns: min_clip, max_clip, scale: computed values
:rtype: float, float, float
"""
min_clip = mean - k * stddev
max_clip = mean + k * stddev
scale = max_range / (max_clip - min_clip)
return min_clip, max_clip, scale
def clip_value(v: float, min_clip: float, max_clip: float):
"""
Clip seismic voxel value
:param min_clip: minimum value used for clipping
:param max_clip: maximum value used for clipping
:returns: clipped value, must be within [min_clip, max_clip]
:rtype: float
"""
# Clip value
if v > max_clip:
v = max_clip
if v < min_clip:
v = min_clip
return v
def norm_value(v: float, min_clip: float, max_clip: float, min_range: float, max_range: float, scale: float):
"""
Normalize seismic voxel value to be within [min_range, max_clip] according to
statisctics computed previously
:param v: value to be normalized
:param min_clip: minimum value used for clipping
:param max_clip: maximum value used for clipping
:param min_range: minium range value
:param max_range: maximum range value
:param scale: scale value to be used for normalization
:returns: normalized value, must be within [min_range, max_range]
:rtype: float
"""
offset = -1 * min_clip # Normalizing - set values between 0 and 1
# Clip value
v = clip_value(v, min_clip, max_clip)
# Scale value
v = (v + offset) * scale
# This value should ALWAYS be between min_range and max_range here
if v > max_range or v < min_range:
raise Exception(
"normalized value should be within [{0},{1}].\
The value was: {2}".format(
min_range, max_range, v
)
)
return v
def normalize_cube(cube: np.array, min_clip: float, max_clip: float, scale: float, min_range: float, max_range: float):
"""
Normalize cube according to statistics. Normalization implies in clipping and normalize cube.
:param cube: 3D array to be normalized
:param min_clip: minimum value used for clipping
:param max_clip: maximum value used for clipping
:param min_range: minium range value
:param max_range: maximum range value
:param scale: scale value to be used for normalization
:returns: normalized 3D array
:rtype: numpy array
"""
# Define function for normalization
vfunc = np.vectorize(norm_value)
# Normalize cube
norm_cube = vfunc(cube, min_clip=min_clip, max_clip=max_clip, min_range=min_range, max_range=max_range, scale=scale)
return norm_cube
def clip_cube(cube: np.array, min_clip: float, max_clip: float):
"""
Clip cube values according to statistics
:param min_clip: minimum value used for clipping
:param max_clip: maximum value used for clipping
:returns: clipped 3D array
:rtype: numpy array
"""
# Define function for normalization
vfunc = np.vectorize(clip_value)
clip_cube = vfunc(cube, min_clip=min_clip, max_clip=max_clip)
return clip_cube
def apply(
cube: np.array, stddev: float, mean: float, k: float, min_range: float, max_range: float, clip=True, normalize=True
):
"""
Preapre data according to provided parameters. This method will compute satistics and can
normalize&clip, just clip, or leave the data as is.
:param cube: 3D array to be normalized
:param stddev: standard deviation value
:param k: number of standard deviation to be used in normalization
:param min_range: minium range value
:param max_range: maximum range value
:param clip: flag to turn on/off clip
:param normalize: flag to turn on/off normalization.
:returns: processed 3D array
:rtype: numpy array
"""
if np.isnan(np.min(cube)):
raise Exception("Cube has NaN value")
if stddev == 0.0:
raise Exception("Standard deviation must not be zero")
if k == 0:
raise Exception("k must not be zero")
# Compute statistics
min_clip, max_clip, scale = compute_statistics(stddev=stddev, mean=mean, k=k, max_range=max_range)
if (clip and normalize) or normalize:
# Normalize&clip cube. Note that it is not possible to normalize data without
# applying clip operation
print("Normalizing and Clipping File")
return normalize_cube(
cube=cube, min_clip=min_clip, max_clip=max_clip, scale=scale, min_range=min_range, max_range=max_range
)
elif clip:
# Only clip values
print("Clipping File")
return clip_cube(cube=cube, min_clip=min_clip, max_clip=max_clip)

Просмотреть файл

@ -0,0 +1,344 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
Methods for processing segy files that do not include well formed geometry. In these cases, segyio
cannot infer the 3D volume of data from the traces so this module needs to do that manually
"""
import os
import math
import segyio
import pandas as pd
import numpy as np
import json
# strings which indicate which slice direction has fewer datapoints, i.e. faster to iterate through
FAST = "fast"
SLOW = "slow"
DEFAULT_VALUE = 255
def get_segy_metadata(input_file, iline, xline):
"""
Loads segy file and uses the input inline and crossline byte values to load
the trace headers. It determines which inline or crossline the traces
start with. SEGY files can be non standard and use other byte location for
these values. In that case, the data from this method will be erroneous. It
is up to the user to figure out which numbers to use by reading the SEGY text
header and finding the byte offsets visually.
:param str input_file: path to segy file
:param int iline: inline byte position
:param int xline: crossline byte position
:returns: fast_distinct, slow_distinct, trace_headers, samplesize
:rtype: list, DataFrame, DataFrame, int
"""
with segyio.open(input_file, ignore_geometry=True) as segy_file:
segy_file.mmap()
# Initialize df with trace id as index and headers as columns
trace_headers = pd.DataFrame(index=range(0, segy_file.tracecount), columns=["i", "j"])
# Fill dataframe with all trace headers values
trace_headers["i"] = segy_file.attributes(iline)
trace_headers["j"] = segy_file.attributes(xline)
_identify_fast_direction(trace_headers, FAST, SLOW)
samplesize = len(segy_file.samples)
fast_distinct = _remove_duplicates(trace_headers[FAST])
slow_distinct = np.unique(trace_headers[SLOW])
return fast_distinct, slow_distinct, trace_headers, samplesize
def process_segy_data_into_single_array(input_file, output_dir, prefix, iline=189, xline=193):
"""
Open segy file and write all data to single npy array
:param str input_file: path to segyfile
:param str output_dir: path to directory where npy files will be written/rewritten
:param str prefix: prefix to use when writing npy files
:param int iline: iline header byte location
:param int xline: crossline header byte location
:returns: 3 dimentional numpy array of SEGY data
:rtype: nparray
"""
fast_distinct, slow_distinct, trace_headers, sampledepth = get_segy_metadata(input_file, iline, xline)
with segyio.open(input_file, ignore_geometry=True) as segy_file:
segy_file.mmap()
fast_line_space = abs(fast_distinct[1] - fast_distinct[0])
slow_line_space = abs(slow_distinct[0] - slow_distinct[1])
sample_size = len(segy_file.samples)
layer_fastmax = max(fast_distinct)
layer_fastmin = min(fast_distinct)
layer_slowmax = max(slow_distinct)
layer_slowmin = min(slow_distinct)
layer_trace_ids = trace_headers[
(trace_headers.fast >= layer_fastmin)
& (trace_headers.fast <= layer_fastmax)
& (trace_headers.slow >= layer_slowmin)
& (trace_headers.slow <= layer_slowmax)
]
block = np.full((len(fast_distinct), len(slow_distinct), sampledepth), DEFAULT_VALUE, dtype=np.float32)
for _, row in layer_trace_ids.iterrows():
block[
(row[FAST] - layer_fastmin) // fast_line_space,
(row[SLOW] - layer_slowmin) // slow_line_space,
0:sample_size,
] = segy_file.trace[row.name]
np.save(
os.path.join(output_dir, "{}_{}_{}_{:05d}".format(prefix, fast_distinct[0], slow_distinct[0], 0)), block
)
variance = np.var(block)
stddev = np.sqrt(variance)
mean = np.mean(block)
with open(os.path.join(output_dir, prefix + "_stats.json"), "w") as f:
f.write(json.dumps({"stddev": str(stddev), "mean": str(mean)}))
print("Npy files written: 1")
return block
def process_segy_data(input_file, output_dir, prefix, iline=189, xline=193, n_points=128, stride=128):
"""
Open segy file and write all numpy array files to disk
:param str input_file: path to segyfile
:param str output_dir: path to directory where npy files will be written/rewritten
:param str prefix: prefix to use when writing npy files
:param int iline: iline header byte location
:param int xline: crossline header byte location
:param int n_points: output cube size
:param int stride: stride when writing data
"""
fast_indexes, slow_indexes, trace_headers, _ = get_segy_metadata(input_file, iline, xline)
with segyio.open(input_file, ignore_geometry=True) as segy_file:
segy_file.mmap()
# Global variance of segy data
variance = 0
mean = 0
sample_count = 0
filecount = 0
block_size = n_points ** 3
for block, i, j, k in _generate_all_blocks(
segy_file, n_points, stride, fast_indexes, slow_indexes, trace_headers
):
# Getting global variance as sum of local variance
if variance == 0:
# init
variance = np.var(block)
mean = np.mean(block)
sample_count = block_size
else:
new_avg = np.mean(block)
new_variance = np.var(block)
variance = _parallel_variance(mean, sample_count, variance, new_avg, block_size, new_variance)
mean = ((mean * sample_count) + np.sum(block)) / (sample_count + block_size)
sample_count += block_size
np.save(os.path.join(output_dir, "{}_{}_{}_{:05d}".format(prefix, i, j, k)), block)
filecount += 1
stddev = np.sqrt(variance)
with open(os.path.join(output_dir, prefix + "_stats.json"), "w") as f:
f.write(json.dumps({"stddev": stddev, "mean": mean}))
print("Npy files written: {}".format(filecount))
def process_segy_data_column(input_file, output_dir, prefix, i, j, iline=189, xline=193, n_points=128, stride=128):
"""
Open segy file and write one column of npy files to disk
:param str input_file: segy file path
:param str output_dir: local output directory for npy files
:param str prefix: naming prefix for npy files
:param int i: index for column data to extract
:param int j: index for column data to extractc
:param int iline: header byte location for inline
:param int xline: header byte location for crossline
:param int n_points: size of cube
:param int stride: stride for generating cubes
"""
fast_indexes, slow_indexes, trace_headers, _ = get_segy_metadata(input_file, iline, xline)
with segyio.open(input_file, ignore_geometry=True) as segy_file:
segy_file.mmap()
filecount = 0
for block, i, j, k in _generate_column_blocks(
segy_file, n_points, stride, i, j, fast_indexes, slow_indexes, trace_headers
):
np.save(os.path.join(output_dir, "{}_{}_{}_{}".format(prefix, i, j, k)), block)
filecount += 1
print("Files written: {}".format(filecount))
def _parallel_variance(avg_a, count_a, var_a, avg_b, count_b, var_b):
"""
Calculate the new variance based on previous calcuated variance
:param float avg_a: overall average
:param float count_a: overall count
:param float var_a: current variance
:param float avg_b: ne average
:param float count_b: current count
:param float var_b: current variance
:returns: new variance
:rtype: float
"""
delta = avg_b - avg_a
m_a = var_a * (count_a - 1)
m_b = var_b * (count_b - 1)
M2 = m_a + m_b + delta ** 2 * count_a * count_b / (count_a + count_b)
return M2 / (count_a + count_b - 1)
def _identify_fast_direction(trace_headers, fastlabel, slowlabel):
"""
Returns the modified dataframe with columns labelled as 'fast' and 'slow'
Uses the count of changes in indexes for both columns to determine which one is the fast index
:param DataFrame trace_headers: dataframe with two columns
:param str fastlabel: key label for the fast index
:param str slowlabel: key label for the slow index
"""
j_count = 0
i_count = 0
last_trace = 0
slope_run = 5
for trace in trace_headers["j"][0:slope_run]:
if not last_trace == trace:
j_count += 1
last_trace = trace
last_trace = 0
for trace in trace_headers["i"][0:slope_run]:
if not last_trace == trace:
i_count += 1
last_trace = trace
if i_count < j_count:
trace_headers.columns = [fastlabel, slowlabel]
else:
trace_headers.columns = [slowlabel, fastlabel]
def _remove_duplicates(list_of_elements):
"""
Remove duplicates from a list but maintain the order
:param list list_of_elements: list to be deduped
:returns: list containing a distinct list of elements
:rtype: list
"""
seen = set()
return [x for x in list_of_elements if not (x in seen or seen.add(x))]
def _get_trace_column(n_lines, i, j, trace_headers, fast_distinct, slow_distinct, segyfile):
"""
:param int n_lines: number of voxels to extract in each dimension
:param int i: fast index anchor for origin of column
:param int j: slow index anchor for origin of column
:param DataFrame trace_headers: DataFrame of all trace headers
:param list fast_distinct: list of distinct fast headers
:param list slow_distinct: list of distinct slow headers
:param segyio.file segyfile: segy file object previously opened using segyio
:returns: thiscolumn, layer_fastmin, layer_slowmin
:rtype: nparray, int, int
"""
layer_fastidxs = fast_distinct[i : i + n_lines]
fast_line_space = abs(fast_distinct[1] - fast_distinct[0])
layer_slowidxs = slow_distinct[j : j + n_lines]
slow_line_space = abs(slow_distinct[0] - slow_distinct[1])
sample_size = len(segyfile.samples)
sample_chunck_count = math.ceil(sample_size / n_lines)
layer_fastmax = max(layer_fastidxs)
layer_fastmin = min(layer_fastidxs)
layer_slowmax = max(layer_slowidxs)
layer_slowmin = min(layer_slowidxs)
layer_trace_ids = trace_headers[
(trace_headers.fast >= layer_fastmin)
& (trace_headers.fast <= layer_fastmax)
& (trace_headers.slow >= layer_slowmin)
& (trace_headers.slow <= layer_slowmax)
]
thiscolumn = np.zeros((n_lines, n_lines, sample_chunck_count * n_lines), dtype=np.float32)
for _, row in layer_trace_ids.iterrows():
thiscolumn[
(row[FAST] - layer_fastmin) // fast_line_space,
(row[SLOW] - layer_slowmin) // slow_line_space,
0:sample_size,
] = segyfile.trace[row.name]
return thiscolumn, layer_fastmin, layer_slowmin
def _generate_column_blocks(segy_file, n_points, stride, i, j, fast_indexes, slow_indexes, trace_headers):
"""
Generate arrays for an open segy file (via segyio)
:param segyio.file segy_file: input segy file previously opened using segyio
:param int n_points: number of voxels to extract in each dimension
:param int stride: overlap for output cubes
:param int i: fast index anchor for origin of column
:param int j: slow index anchor for origin of column
:param list fast_indexes: list of distinct fast headers
:param list slow_indexes: list of distinct slow headers
:param DataFrame trace_headers: trace headers including fast and slow indexes
:returns: thiscolumn, fast_anchor, slow_anchor, k
:rtype: nparray, int, int, int
"""
sample_size = len(segy_file.samples)
thiscolumn, fast_anchor, slow_anchor = _get_trace_column(
n_points, i, j, trace_headers, fast_indexes, slow_indexes, segy_file
)
for k in range(0, sample_size - stride, stride):
yield thiscolumn[i : (i + n_points), j : (j + n_points), k : (k + n_points)], fast_anchor, slow_anchor, k
def _generate_all_blocks(segy_file, n_points, stride, fast_indexes, slow_indexes, trace_headers):
"""
Generate arrays for an open segy file (via segyio)
:param segyio.file segy_file: input segy file previously opened using segyio
:param int n_points: number of voxels to extract in each dimension
:param int stride: overlap for output cubes
:param list fast_indexes: list of distinct fast headers
:param list slow_indexes: list of distinct slow headers
:param DataFrame trace_headers: trace headers including fast and slow indexes
:returns: thiscolumn, fast_anchor, slow_anchor, k
:rtype: nparray, int, int, int
"""
slow_size = len(slow_indexes)
fast_size = len(fast_indexes)
sample_size = len(segy_file.samples)
# Handle edge case when stride is larger than slow_size and fast_size
fast_lim = fast_size
slow_lim = slow_size
for i in range(0, fast_lim, stride):
for j in range(0, slow_lim, stride):
thiscolumn, fast_anchor, slow_anchor = _get_trace_column(
n_points, i, j, trace_headers, fast_indexes, slow_indexes, segy_file
)
for k in range(0, sample_size, stride):
yield thiscolumn[:, :, k : (k + n_points)], fast_anchor, slow_anchor, k
def timewrapper(func, *args, **kwargs):
"""
utility function to pass argumentswhile using the timer module
:param function func: function to wrap
:param args: parameters accepted by func
:param kwargs: optional parameters accepted by func
:returns: wrapped
:rtype: function
"""
def wrapped():
"""
Wrapper function that takes no arguments
:returns: func
:rtype: function
"""
return func(*args, **kwargs)
return wrapped

Просмотреть файл

@ -1,3 +1,6 @@
numpy>=1.17.0
azure-cli-core
azureml-sdk==1.0.74
azureml-sdk==1.0.83
azureml-contrib-pipeline-steps==1.0.83
azureml-contrib-services==1.0.83
python-dotenv==0.10.5

Просмотреть файл

@ -0,0 +1,25 @@
{
"step1": {
"type": "MpiStep",
"name": "train step",
"script": "train.py",
"input_datareference_path": "data/",
"input_datareference_name": "ds_test",
"input_dataset_name": "deepseismic_test_dataset",
"source_directory": "experiments/interpretation/dutchf3_patch",
"arguments": [
"--cfg",
"configs/unet.yaml",
"TRAIN.END_EPOCH",
"1",
"TRAIN.SNAPSHOTS",
"1",
"DATASET.ROOT",
"data"
],
"requirements": "experiments/interpretation/dutchf3_patch/azureml_requirements.txt",
"node_count": 1,
"processes_per_node": 1,
"base_image": "pytorch/pytorch"
}
}

Просмотреть файл

@ -0,0 +1,97 @@
"""
Integration tests for the train pipeline
"""
import pytest
from deepseismic_interpretation.azureml_pipelines.train_pipeline import TrainPipeline
import json
import os
TEMP_CONFIG_FILE = "test_batch_config.json"
test_data = None
class TestTrainPipelineIntegration:
"""
Class used for testing the training pipeline
"""
global test_data
test_data = {
"step1": {
"type": "MpiStep",
"name": "train step",
"script": "train.py",
"input_datareference_path": "data/",
"input_datareference_name": "training_data",
"input_dataset_name": "f3_data",
"source_directory": "experiments/interpretation/dutchf3_patch",
"arguments": ["TRAIN.END_EPOCH", "1"],
"requirements": "requirements.txt",
"node_count": 1,
"processes_per_node": 1,
}
}
@pytest.fixture(scope="function", autouse=True)
def teardown(self):
yield
if hasattr(self, "run"):
self.run.cancel()
os.remove(TEMP_CONFIG_FILE)
def test_train_pipeline_expected_inputs_submits_correctly(self):
# arrange
self._setup_test_config()
orchestrator = TrainPipeline(
"interpretation/tests/example_config.json"
) # updated this to be an example of our configh
# act
orchestrator.construct_pipeline()
self.run = orchestrator.run_pipeline(experiment_name="TEST-train-pipeline")
# assert
assert self.run.get_status() == "Running" or "NotStarted"
@pytest.mark.parametrize(
"step,missing_dependency",
[
("step1", "name"),
("step1", "type"),
("step1", "input_datareference_name"),
("step1", "input_datareference_path"),
("step1", "input_dataset_name"),
("step1", "source_directory"),
("step1", "script"),
("step1", "arguments"),
],
)
def test_missing_dependency_in_config_throws_error(self, step, missing_dependency):
# iterates throw all config dependencies leaving them each out
# arrange
self.data = test_data
self._create_config_without(step, missing_dependency)
self._setup_test_config()
orchestrator = TrainPipeline(self.test_config)
# act / assert
with pytest.raises(KeyError):
orchestrator.construct_pipeline()
def _create_config_without(self, step, dependency_to_omit):
"""
helper function that removes dependencies from config file
:param str step: name of the step with omitted dependency
:param str dependency_to_omit: the dependency you want to omit from the config
"""
self.data[step].pop(dependency_to_omit, None)
def _setup_test_config(self):
"""
helper function that saves the test data in a temp config file
"""
self.data = test_data
self.test_config = TEMP_CONFIG_FILE
with open(self.test_config, "w") as data_file:
json.dump(self.data, data_file)

Просмотреть файл

@ -1,4 +1,6 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# autoformats all files in the repo to black

151
scripts/byod_penobscot.py Normal file
Просмотреть файл

@ -0,0 +1,151 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""
Run example:
python byod_penobscot.py --filename <input HDF5 file> --outdir <where to output data>
python prepare_dutchf3.py split_train_val patch --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
"""
import sklearn
""" libraries """
import h5py
import numpy as np
import os
np.set_printoptions(linewidth=200)
import logging
# toggle to WARNING when running in production, or use CLI
logging.getLogger().setLevel(logging.DEBUG)
# logging.getLogger().setLevel(logging.WARNING)
import argparse
parser = argparse.ArgumentParser()
""" useful information when running from a GIT folder."""
myname = os.path.realpath(__file__)
mypath = os.path.dirname(myname)
myname = os.path.basename(myname)
def main(args):
"""
Transforms Penobscot HDF5 dataset into DeepSeismic Tensor Format
"""
logging.info("loading data")
f = h5py.File(args.filename, "r")
data = f["features"][:, :, :, 0]
labels = f["label"][:, :, :]
assert labels.min() == 0
n_classes = labels.max() + 1
assert n_classes == N_CLASSES
# inline x depth x crossline, make it inline x crossline x depth
data = np.swapaxes(data, 1, 2)
labels = np.swapaxes(labels, 1, 2)
# Make data cube fast to access
data = np.ascontiguousarray(data, "float32")
labels = np.ascontiguousarray(labels, "uint8")
# combine classes 4 and 5 (index 3 and 4)- shift others down
labels[labels > 3] -= 1
# rescale to be within a certain range
range_min, range_max = -1.0, 1.0
data_std = (data - data.min()) / (data.max() - data.min())
data = data_std * (range_max - range_min) + range_min
"""
# cut off a buffer zone around the volume (to avoid mislabeled data):
buffer = 25
data = data[:, buffer:-buffer, buffer:-buffer]
labels = labels[:, buffer:-buffer, buffer:-buffer]
"""
# time by crosslines by inlines
n_inlines = data.shape[0]
n_crosslines = data.shape[1]
inline_cut = int(np.floor(n_inlines * INLINE_FRACTION))
crossline_cut = int(np.floor(n_crosslines * CROSSLINE_FRACTION))
data_train = data[0:inline_cut, 0:crossline_cut, :]
data_test1 = data[inline_cut:n_inlines, :, :]
data_test2 = data[:, crossline_cut:n_crosslines, :]
labels_train = labels[0:inline_cut, 0:crossline_cut, :]
labels_test1 = labels[inline_cut:n_inlines, :, :]
labels_test2 = labels[:, crossline_cut:n_crosslines, :]
def mkdir(dirname):
if os.path.isdir(dirname) and os.path.exists(dirname):
return
if not os.path.isdir(dirname) and os.path.exists(dirname):
logging.info("remote file", dirname, "and run this script again")
os.mkdir(dirname)
mkdir(args.outdir)
mkdir(os.path.join(args.outdir, "splits"))
mkdir(os.path.join(args.outdir, "train"))
mkdir(os.path.join(args.outdir, "test_once"))
np.save(os.path.join(args.outdir, "train", "train_seismic.npy"), data_train)
np.save(os.path.join(args.outdir, "train", "train_labels.npy"), labels_train)
np.save(os.path.join(args.outdir, "test_once", "test1_seismic.npy"), data_test1)
np.save(os.path.join(args.outdir, "test_once", "test1_labels.npy"), labels_test1)
np.save(os.path.join(args.outdir, "test_once", "test2_seismic.npy"), data_test2)
np.save(os.path.join(args.outdir, "test_once", "test2_labels.npy"), labels_test2)
# Compute class weights:
num_classes, class_count = np.unique(labels[:], return_counts=True)
# class_probabilities = np.histogram(labels[:], bins= , density=True)
class_weights = 1 - class_count / np.sum(class_count)
logging.info("CLASS WEIGHTS TO USE")
logging.info(class_weights)
logging.info("MEAN")
logging.info(data.mean())
logging.info("STANDARD DEVIATION")
logging.info(data.std())
""" GLOBAL VARIABLES """
INLINE_FRACTION = 0.7
CROSSLINE_FRACTION = 1.0
N_CLASSES = 8
parser.add_argument("--filename", help="Name of HDF5 data", type=str, required=True)
parser.add_argument("--outdir", help="Output data directory location", type=str, required=True)
""" main wrapper with profiler """
if __name__ == "__main__":
main(parser.parse_args())
# pretty printing of the stack
"""
try:
logging.info('before main')
main(parser.parse_args())
logging.info('after main')
except:
for frame in traceback.extract_tb(sys.exc_info()[2]):
fname,lineno,fn,text = frame
print ("Error in %s on line %d" % (fname, lineno))
"""
# optionally enable profiling information
# import cProfile
# name = <insert_name_here>
# cProfile.run('main.run()', name + '.prof')
# import pstats
# p = pstats.Stats(name + '.prof')
# p.sort_stats('cumulative').print_stats(10)
# p.sort_stats('time').print_stats()

Просмотреть файл

@ -1,4 +1,6 @@
#!/usr/bin/env python3
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
""" Please see the def main() function for code description."""
import time
@ -10,6 +12,8 @@ import sys
import yaml
import subprocess
from datetime import datetime
np.set_printoptions(linewidth=200)
import logging
@ -32,6 +36,8 @@ def main(args):
add --setup to run it (destroys existing environment and creates a new one, along with all the data)
"""
beg = datetime.now()
logging.info("loading data")
@ -41,7 +47,7 @@ def main(args):
logging.info(f"Loaded {file}")
# run single job
job_names = [x["job"] for x in list["jobs"]] if not args.job else args.job.split(',')
job_names = [x["job"] for x in list["jobs"]] if not args.job else args.job.split(",")
if not args.setup and "setup" in job_names:
job_names.remove("setup")
@ -52,7 +58,7 @@ def main(args):
current_env = os.environ.copy()
# modify for conda to work
# TODO: not sure why on DS VM this does not get picked up from the standard environment
current_env["PATH"] = PATH_PREFIX+":"+current_env["PATH"]
current_env["PATH"] = PATH_PREFIX + ":" + current_env["PATH"]
for job in job_list:
job_name = job["job"]
@ -74,16 +80,16 @@ def main(args):
stderr=subprocess.STDOUT,
executable=current_env["SHELL"],
env=current_env,
cwd=os.getcwd()
cwd=os.getcwd(),
)
toc = time.perf_counter()
print(f"Job time took {(toc-tic)/60:0.2f} minutes")
except subprocess.CalledProcessError as err:
logging.info(f'ERROR: \n{err}')
decoded_stdout = err.stdout.decode('utf-8')
logging.info(f"ERROR: \n{err}")
decoded_stdout = err.stdout.decode("utf-8")
log_file = "dev_build.latest_error.log"
logging.info(f"Have {len(err.stdout)} output bytes in {log_file}")
with open(log_file, 'w') as log_file:
with open(log_file, "w") as log_file:
log_file.write(decoded_stdout)
sys.exit()
else:
@ -92,8 +98,13 @@ def main(args):
logging.info(f"Everything ran! You can try running the same jobs {job_names} on the build VM now")
end = datetime.now()
print("time elapsed in seconds", (end - beg).total_seconds())
""" GLOBAL VARIABLES """
PATH_PREFIX = "/data/anaconda/envs/seismic-interpretation/bin:/data/anaconda/bin"
PATH_PREFIX = "/anaconda/envs/seismic-interpretation/bin:/anaconda/bin"
parser.add_argument(
"--file", help="Which yaml file you'd like to read which specifies build info", type=str, required=True

Просмотреть файл

@ -1,4 +1,6 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
conda env remove -n seismic-interpretation
yes | conda env create -f environment/anaconda/local/environment.yml

Просмотреть файл

@ -1,4 +1,7 @@
#!/usr/bin/env python3
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
""" Please see the def main() function for code description."""
""" libraries """
@ -83,14 +86,14 @@ def make_gradient(n_inlines, n_crosslines, n_depth, box_size, dir="inline"):
:return: numpy array
"""
orthogonal_dir = dir # for depth case
if dir=='inline':
orthogonal_dir = 'crossline'
elif dir=='crossline':
orthogonal_dir = 'inline'
orthogonal_dir = dir # for depth case
if dir == "inline":
orthogonal_dir = "crossline"
elif dir == "crossline":
orthogonal_dir = "inline"
axis = GRADIENT_DIR.index(orthogonal_dir)
n_points = (n_inlines, n_crosslines, n_depth)[axis]
n_classes = int(np.ceil(float(n_points) / box_size))
logging.info(f"GRADIENT: we will output {n_classes} classes in the {dir} direction")
@ -127,35 +130,57 @@ def main(args):
logging.info("loading data")
train_seismic = np.load(os.path.join(args.dataroot, "train", "train_seismic.npy"))
train_labels = np.load(os.path.join(args.dataroot, "train", "train_labels.npy"))
test1_seismic = np.load(os.path.join(args.dataroot, "test_once", "test1_seismic.npy"))
test1_labels = np.load(os.path.join(args.dataroot, "test_once", "test1_labels.npy"))
test2_seismic = np.load(os.path.join(args.dataroot, "test_once", "test2_seismic.npy"))
test2_labels = np.load(os.path.join(args.dataroot, "test_once", "test2_labels.npy"))
# TODO: extend this to binary and gradient
if args.type != "checkerboard":
assert args.based_on == "dutch_f3"
assert train_seismic.shape == train_labels.shape
assert train_seismic.min() == WHITE
assert train_seismic.max() == BLACK
assert train_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert train_labels.max() == 5
logging.info(f"synthetic data generation based on {args.based_on}")
assert test1_seismic.shape == test1_labels.shape
assert test1_seismic.min() == WHITE
assert test1_seismic.max() == BLACK
assert test1_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert test1_labels.max() == 5
if args.based_on == "dutch_f3":
assert test2_seismic.shape == test2_labels.shape
assert test2_seismic.min() == WHITE
assert test2_seismic.max() == BLACK
assert test2_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert test2_labels.max() == 5
train_seismic = np.load(os.path.join(args.dataroot, "train", "train_seismic.npy"))
train_labels = np.load(os.path.join(args.dataroot, "train", "train_labels.npy"))
test1_seismic = np.load(os.path.join(args.dataroot, "test_once", "test1_seismic.npy"))
test1_labels = np.load(os.path.join(args.dataroot, "test_once", "test1_labels.npy"))
test2_seismic = np.load(os.path.join(args.dataroot, "test_once", "test2_seismic.npy"))
test2_labels = np.load(os.path.join(args.dataroot, "test_once", "test2_labels.npy"))
assert train_seismic.shape == train_labels.shape
assert train_seismic.min() == WHITE
assert train_seismic.max() == BLACK
assert train_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert train_labels.max() == 5
assert test1_seismic.shape == test1_labels.shape
assert test1_seismic.min() == WHITE
assert test1_seismic.max() == BLACK
assert test1_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert test1_labels.max() == 5
assert test2_seismic.shape == test2_labels.shape
assert test2_seismic.min() == WHITE
assert test2_seismic.max() == BLACK
assert test2_labels.min() == 0
# this is the number of classes in Alaudah's Dutch F3 dataset
assert test2_labels.max() == 5
elif args.based_on == "fixed_box_number":
logging.info(f"box_number is {args.box_number}")
logging.info(f"box_size is {args.box_size}")
# Note: this assumes the data is 3D, opening up higher dimensions, this (and other parts of this scrpit)
# must be refactored
synthetic_shape = (int(args.box_number * args.box_size),) * 3
train_seismic = np.ones(synthetic_shape, dtype=float)
train_labels = np.ones(synthetic_shape, dtype=int)
test1_seismic = train_seismic
test1_labels = train_labels
test2_seismic = train_seismic
test2_labels = train_labels
if args.type == "checkerboard":
logging.info("train checkerbox")
n_inlines, n_crosslines, n_depth = train_seismic.shape
checkerboard_train_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
@ -163,23 +188,26 @@ def main(args):
checkerboard_train_labels = checkerboard_train_seismic.astype(train_labels.dtype)
# labels are integers and start from zero
checkerboard_train_labels[checkerboard_train_seismic < WHITE_LABEL] = WHITE_LABEL
logging.info(f"training data shape {checkerboard_train_seismic.shape}")
# create checkerbox
logging.info("test1 checkerbox")
n_inlines, n_crosslines, n_depth = test1_seismic.shape
checkerboard_test1_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
checkerboard_test1_seismic = checkerboard_test1_seismic.astype(test1_seismic.dtype)
checkerboard_test1_labels = checkerboard_test1_seismic.astype(test1_labels.dtype)
# labels are integers and start from zero
checkerboard_test1_labels[checkerboard_test1_seismic < WHITE_LABEL] = WHITE_LABEL
logging.info(f"test1 data shape {checkerboard_test1_seismic.shape}")
logging.info("test2 checkerbox")
n_inlines, n_crosslines, n_depth = test2_seismic.shape
checkerboard_test2_seismic = make_box(n_inlines, n_crosslines, n_depth, args.box_size)
checkerboard_test2_seismic = checkerboard_test2_seismic.astype(test2_seismic.dtype)
checkerboard_test2_labels = checkerboard_test2_seismic.astype(test2_labels.dtype)
# labels are integers and start from zero
checkerboard_test2_labels[checkerboard_test2_seismic < WHITE_LABEL] = WHITE_LABEL
logging.info(f"test2 data shape {checkerboard_test2_seismic.shape}")
# substitute gradient dataset instead of checkerboard
elif args.type == "gradient":
@ -257,10 +285,20 @@ WHITE_LABEL = 0
BLACK_LABEL = BLACK
TYPES = ["checkerboard", "gradient", "binary"]
GRADIENT_DIR = ["inline", "crossline", "depth"]
METHODS = ["dutch_f3", "fixed_box_number"]
parser.add_argument("--dataroot", help="Root location of the input data", type=str, required=True)
parser.add_argument("--dataout", help="Root location of the output data", type=str, required=True)
parser.add_argument("--box_size", help="Size of the bounding box", type=int, required=False, default=100)
parser.add_argument(
"--based_on",
help="This determines the shape of synthetic data array",
type=str,
required=False,
choices=METHODS,
default="dutch_f3",
)
parser.add_argument("--box_number", help="Number of boxes", type=int, required=False, default=2)
parser.add_argument(
"--type", help="Type of data to generate", type=str, required=False, choices=TYPES, default="checkerboard",
)

Просмотреть файл

@ -148,6 +148,10 @@ def split_patch_train_val(
iline, xline, depth = labels.shape
# Since the locations we will save reference the padded volume, we will increase
# the depth of the volume by the padding amount (2*patch_size).
depth += 2 * patch_size
split_direction = split_direction.lower()
if split_direction == "inline":
num_sections, section_length = iline, xline
@ -157,8 +161,10 @@ def split_patch_train_val(
raise ValueError(f"Unknown split_direction: {split_direction}")
train_range, val_range = _get_aline_range(num_sections, per_val, section_stride)
vert_locations = range(0, depth, patch_stride)
buffer = patch_size // 2
vert_locations = range(buffer, depth - patch_size - buffer, patch_stride)
horz_locations = range(0, section_length, patch_stride)
logger.debug(vert_locations)
logger.debug(horz_locations)

Просмотреть файл

@ -41,6 +41,8 @@ def _copy_files(files_iter, new_dir):
def _split_train_val_test(partition, val_ratio, test_ratio):
logger = logging.getLogger("__name__")
logger.warning(f"prepare_penobscot.py does not support padding. Results might be incorrect. ")
total_samples = len(partition)
val_samples = math.floor(val_ratio * total_samples)
test_samples = math.floor(test_ratio * total_samples)

Просмотреть файл

@ -0,0 +1,2 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

Просмотреть файл

@ -1,54 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# Pull request against these branches will trigger this build
pr:
- master
- staging
- contrib
# Any commit to this branch will trigger the build.
trigger:
- master
- staging
- contrib
jobs:
# partially disable setup for now - done manually on build VM
- job: setup
timeoutInMinutes: 10
displayName: Setup
pool:
name: deepseismicagentpool
steps:
- bash: |
# terminate as soon as any internal script fails
set -e
echo "Running setup..."
pwd
ls
git branch
uname -ra
# ENABLE ALL FOLLOWING CODE WHEN YOU'RE READY TO ADD AML BUILD - disabled right now
# ./scripts/env_reinstall.sh
# use hardcoded root for now because not sure how env changes under ADO policy
# DATA_ROOT="/home/alfred/data_dynamic"
# ./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}
# copy your model files like so - using dummy file to illustrate
# azcopy --quiet --source:https://$(storagename).blob.core.windows.net/models/model --source-key $(storagekey) --destination /home/alfred/models/your_model_name
- job: AML_job_placeholder
dependsOn: setup
timeoutInMinutes: 5
displayName: AML job placeholder
pool:
name: deepseismicagentpool
steps:
- bash: |
# UNCOMMENT THIS WHEN YOU HAVE UNCOMMENTED THE SETUP JOB
# source activate seismic-interpretation
echo "TADA!!"

Просмотреть файл

@ -52,10 +52,17 @@ jobs:
./tests/cicd/src/scripts/get_data_for_builds.sh ${DATA_ROOT}
# taken from https://zenodo.org/record/3924682
# paper https://arxiv.org/abs/1905.04307
# TODO: enable when Penobscot is ready to be provided in the repo - rough sequence of steps below
# cd scripts
# wget -o /dev/null -O dataset.h5 https://zenodo.org/record/3924682/files/dataset.h5?download=1
# python byod_penobscot.py --filename dataset.h5 --outdir <where to output data>
# python prepare_dutchf3.py split_train_val patch --data_dir=<outdir from the previous step> --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100
# copy your model files like so - using dummy file to illustrate
azcopy --quiet --source:https://$(storagename).blob.core.windows.net/models/model --source-key $(storagekey) --destination /home/alfred/models/your_model_name
###################################################################################################
# Stage 2: fast unit tests
###################################################################################################
@ -63,7 +70,7 @@ jobs:
- job: scripts_unit_tests_job
dependsOn: setup
timeoutInMinutes: 5
displayName: Unit Tests
displayName: Generic Unit Tests
pool:
name: deepseismicagentpool
steps:
@ -72,10 +79,35 @@ jobs:
echo "Starting scripts unit tests"
source activate seismic-interpretation
pytest --durations=0 tests/
echo "Script unit test job passed"
- job: data_loaders_unit_tests_job
dependsOn: scripts_unit_tests_job
timeoutInMinutes: 5
displayName: Data Loaders Unit Tests
pool:
name: deepseismicagentpool
steps:
- bash: |
set -e
echo "Starting scripts unit tests"
source activate seismic-interpretation
pytest --durations=0 interpretation/deepseismic_interpretation/dutchf3/tests/
- job: segy_utils_unit_test_job
dependsOn: data_loaders_unit_tests_job
timeoutInMinutes: 5
displayName: SEGY Converter Unit Tests
pool:
name: deepseismicagentpool
steps:
- bash: |
set -e
echo "Starting scripts unit tests"
source activate seismic-interpretation
pytest --durations=0 interpretation/deepseismic_interpretation/segyconverter/test
- job: cv_lib_unit_tests_job
dependsOn: scripts_unit_tests_job
dependsOn: segy_utils_unit_test_job
timeoutInMinutes: 5
displayName: cv_lib Unit Tests
pool:
@ -89,15 +121,15 @@ jobs:
echo "cv_lib unit test job passed"
###################################################################################################
# Stage 3: Dutch F3 patch models on checkerboard test set:
# Stage 3: Patch models on checkerboard test set:
# deconvnet, unet, HRNet patch depth, HRNet section depth
# CAUTION: reverted these builds to single-GPU leaving new multi-GPU code in to be reverted later
###################################################################################################
- job: checkerboard_dutchf3_patch
- job: checkerboard_patch
dependsOn: cv_lib_unit_tests_job
timeoutInMinutes: 30
displayName: Checkerboard Dutch F3 patch local
timeoutInMinutes: 15
displayName: Checkerboard patch local
pool:
name: deepseismicagentpool
steps:
@ -108,7 +140,7 @@ jobs:
# disable auto error handling as we flag it manually
set +e
cd experiments/interpretation/dutchf3_patch/local
cd experiments/interpretation/dutchf3_patch
# Create a temporary directory to store the statuses
dir=$(mktemp -d)
@ -119,36 +151,44 @@ jobs:
pids=
# export CUDA_VISIBLE_DEVICES=0
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
'NUM_DEBUG_BATCHES' 50 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'NUM_DEBUG_BATCHES' 64 \
'TRAIN.END_EPOCH' 13 'TRAIN.SNAPSHOTS' 1 \
'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
'TRAIN.DEPTH' 'none' \
'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'no_depth' \
'WORKERS' 1 \
--cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=1
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
'NUM_DEBUG_BATCHES' 10 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'NUM_DEBUG_BATCHES' 64 \
'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
--cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=2
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
'NUM_DEBUG_BATCHES' 50 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'NUM_DEBUG_BATCHES' 64 \
'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
--cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=3
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/checkerboard/data' \
'NUM_DEBUG_BATCHES' 5 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'NUM_DEBUG_BATCHES' 64 \
'TRAIN.END_EPOCH' 2 'TRAIN.SNAPSHOTS' 1 \
'DATASET.NUM_CLASSES' 2 'DATASET.CLASS_WEIGHTS' '[1.0, 1.0]' \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 16 'VALIDATION.BATCH_SIZE_PER_GPU' 32 \
'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
@ -166,14 +206,21 @@ jobs:
# Remove the temporary directory
rm -r "$dir"
set -e
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_patch_deconvnet_no_depth.json --step train --train_depth none
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_unet_section_depth.json --step train --train_depth section
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_seresnet_unet_section_depth.json --step train --train_depth section
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_train_hrnet_section_depth.json --step train --train_depth section
set +e
# check validation set performance
set -e
python ../../../../tests/cicd/src/check_performance.py --infile metrics_patch_deconvnet_no_depth.json
python ../../../../tests/cicd/src/check_performance.py --infile metrics_unet_section_depth.json
python ../../../../tests/cicd/src/check_performance.py --infile metrics_seresnet_unet_section_depth.json
python ../../../tests/cicd/src/check_performance.py --infile metrics_patch_deconvnet_no_depth.json
python ../../../tests/cicd/src/check_performance.py --infile metrics_unet_section_depth.json
python ../../../tests/cicd/src/check_performance.py --infile metrics_seresnet_unet_section_depth.json
# TODO: enable HRNet test set metrics when we debug HRNet
# python ../../../../tests/cicd/src/check_performance.py --infile metrics_hrnet_section_depth.json
# python ../../../tests/cicd/src/check_performance.py --infile metrics_hrnet_section_depth.json
set +e
echo "All models finished training - start scoring"
@ -256,173 +303,32 @@ jobs:
# Remove the temporary directory
rm -r "$dir"
# check data flow for test
set -e
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_patch_deconvnet_no_depth.json --step test --train_depth none
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_unet_section_depth.json --step test --train_depth section
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_seresnet_unet_section_depth.json --step test --train_depth section
python ../../../tests/cicd/src/check_data_flow.py --infile data_flow_test_hrnet_section_depth.json --step test --train_depth section
set +e
# check test set performance
set -e
python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_patch_deconvnet_no_depth.json --test
python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_unet_section_depth.json --test
python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_seresnet_unet_section_depth.json --test
python ../../../tests/cicd/src/check_performance.py --infile metrics_test_patch_deconvnet_no_depth.json --test
python ../../../tests/cicd/src/check_performance.py --infile metrics_test_unet_section_depth.json --test
python ../../../tests/cicd/src/check_performance.py --infile metrics_test_seresnet_unet_section_depth.json --test
# TODO: enable HRNet test set metrics when we debug HRNet
# python ../../../../tests/cicd/src/check_performance.py --infile metrics_test_hrnet_section_depth.json --test
# python ../../../tests/cicd/src/check_performance.py --infile metrics_test_hrnet_section_depth.json --test
echo "PASSED"
###################################################################################################
# Stage 3: Dutch F3 patch models: deconvnet, unet, HRNet patch depth, HRNet section depth
# CAUTION: reverted these builds to single-GPU leaving new multi-GPU code in to be reverted later
###################################################################################################
- job: dutchf3_patch
dependsOn: checkerboard_dutchf3_patch
timeoutInMinutes: 60
displayName: Dutch F3 patch local
pool:
name: deepseismicagentpool
steps:
- bash: |
source activate seismic-interpretation
# disable auto error handling as we flag it manually
set +e
cd experiments/interpretation/dutchf3_patch/local
# Create a temporary directory to store the statuses
dir=$(mktemp -d)
pids=
# export CUDA_VISIBLE_DEVICES=0
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'TRAIN.DEPTH' 'none' \
'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'no_depth' \
'WORKERS' 1 \
--cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=1
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
--cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=2
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
--cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=3
{ python train.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' 'TRAIN.END_EPOCH' 1 'TRAIN.SNAPSHOTS' 1 \
'TRAIN.DEPTH' 'section' \
'TRAIN.BATCH_SIZE_PER_GPU' 2 'VALIDATION.BATCH_SIZE_PER_GPU' 2 \
'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
'OUTPUT_DIR' 'output' 'TRAIN.MODEL_DIR' 'section_depth' \
'WORKERS' 1 \
--cfg=configs/hrnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
wait $pids || exit 1
# check if any of the models had an error during execution
# Get return information for each pid
for file in "$dir"/*; do
printf 'PID %d returned %d\n' "${file##*/}" "$(<"$file")"
[[ "$(<"$file")" -ne "0" ]] && exit 1 || echo "pass"
done
# Remove the temporary directory
rm -r "$dir"
echo "All models finished training - start scoring"
# Create a temporary directory to store the statuses
dir=$(mktemp -d)
pids=
# export CUDA_VISIBLE_DEVICES=0
# find the latest model which we just trained
# if we're running on a build VM
model_dir=$(ls -td output/patch_deconvnet/no_depth/* | head -1)
# if we're running in a checked out git repo
[[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/patch_deconvnet/no_depth/* | head -1)
model=$(ls -t ${model_dir}/*.pth | head -1)
# try running the test script
{ python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'no_depth' \
'TEST.MODEL_PATH' ${model} \
'WORKERS' 1 \
--cfg=configs/patch_deconvnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=1
# find the latest model which we just trained
# if we're running on a build VM
model_dir=$(ls -td output/unet/section_depth/* | head -1)
# if we're running in a checked out git repo
[[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/unet/section_depth* | head -1)
model=$(ls -t ${model_dir}/*.pth | head -1)
# try running the test script
{ python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
'TEST.MODEL_PATH' ${model} \
'WORKERS' 1 \
--cfg=configs/unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=2
# find the latest model which we just trained
# if we're running on a build VM
model_dir=$(ls -td output/seresnet_unet/section_depth/* | head -1)
# if we're running in a checked out git repo
[[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/seresnet_unet/section_depth/* | head -1)
model=$(ls -t ${model_dir}/*.pth | head -1)
# try running the test script
{ python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
'TEST.MODEL_PATH' ${model} \
'WORKERS' 1 \
--cfg=configs/seresnet_unet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# export CUDA_VISIBLE_DEVICES=3
# find the latest model which we just trained
# if we're running on a build VM
model_dir=$(ls -td output/hrnet/section_depth/* | head -1)
# if we're running in a checked out git repo
[[ -z ${model_dir} ]] && model_dir=$(ls -td output/$(git rev-parse --abbrev-ref HEAD)/*/hrnet/section_depth/* | head -1)
model=$(ls -t ${model_dir}/*.pth | head -1)
# try running the test script
{ python test.py 'DATASET.ROOT' '/home/alfred/data_dynamic/dutch_f3/data' \
'TEST.SPLIT' 'Both' 'TRAIN.MODEL_DIR' 'section_depth' \
'MODEL.PRETRAINED' '/home/alfred/models/hrnetv2_w48_imagenet_pretrained.pth' \
'TEST.MODEL_PATH' ${model} \
'WORKERS' 1 \
--cfg=configs/hrnet.yaml --debug ; echo "$?" > "$dir/$BASHPID"; }
pids+=" $!"
# wait for completion
wait $pids || exit 1
# check if any of the models had an error during execution
# Get return information for each pid
for file in "$dir"/*; do
printf 'PID %d returned %d\n' "${file##*/}" "$(<"$file")"
[[ "$(<"$file")" -ne "0" ]] && exit 1 || echo "pass"
done
# Remove the temporary directory
rm -r "$dir"
echo "PASSED"
###################################################################################################
# Stage 5: Notebook tests
# Stage 4: Notebook tests
###################################################################################################
- job: F3_block_training_and_evaluation_local_notebook
dependsOn: dutchf3_patch
dependsOn: checkerboard_patch
timeoutInMinutes: 5
displayName: F3 block training and evaluation local notebook
pool:
@ -434,3 +340,40 @@ jobs:
--nbname examples/interpretation/notebooks/Dutch_F3_patch_model_training_and_evaluation.ipynb \
--dataset_root /home/alfred/data_dynamic/dutch_f3/data \
--model_pretrained download
- job: segyconverter_notebooks
dependsOn: F3_block_training_and_evaluation_local_notebook
timeoutInMinutes: 5
displayName: SEGY converter notebooks
pool:
name: deepseismicagentpool
steps:
- bash: |
source activate seismic-interpretation
pytest -s tests/cicd/src/notebook_integration_tests.py \
--nbname examples/interpretation/segyconverter/01_segy_sample_files.ipynb \
--cwd examples/interpretation/segyconverter
pytest -s tests/cicd/src/notebook_integration_tests.py \
--nbname examples/interpretation/segyconverter/02_segy_convert_sample.ipynb \
--cwd examples/interpretation/segyconverter
###################################################################################################
# Stage 5: Docker tests
###################################################################################################
- job: docker_build_test
dependsOn: segyconverter_notebooks
timeoutInMinutes: 30
displayName: Docker build test
pool:
name: deepseismicagentpool
steps:
- bash: |
set -e
echo "build docker"
cd docker
pwd
docker images | grep "seismic-deeplearning" | awk '{print $1 ":" $2}' | xargs docker rmi || echo "pass if no seismic-deeplearning image is found"
docker images | grep "<none>" | awk '{print $1 ":" $2}' | xargs docker rmi || echo "pass if no non-tagged images is found"
docker build -t seismic-deeplearning .

Просмотреть файл

@ -0,0 +1,183 @@
#!/usr/bin/env python3
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
""" Please see the def main() function for code description."""
import json
""" libraries """
import numpy as np
import os
np.set_printoptions(linewidth=200)
import logging
# toggle to WARNING when running in production, or use CLI
logging.getLogger().setLevel(logging.DEBUG)
# logging.getLogger().setLevel(logging.WARNING)
import argparse
parser = argparse.ArgumentParser()
""" useful information when running from a GIT folder."""
myname = os.path.realpath(__file__)
mypath = os.path.dirname(myname)
myname = os.path.basename(myname)
def main(args):
"""
Tests to ensure proper data flow throughout the experiments.
"""
logging.info("loading data")
with open(args.infile, "r") as fp:
data = json.load(fp)
# Note: these are specific to the setup in
# main_build.yml for train.py
# and get_data_for_builds.sh and prepare_dutchf3.py and prepare_dutchf3.py
if args.step == "test":
for test_key in data.keys():
if args.train_depth == "none":
expected_test_input_shape = (200, 200, 200)
expected_img = (1, 1, 200, 200)
elif args.train_depth == "section":
expected_test_input_shape = (200, 3, 200, 200)
expected_img = (1, 3, 200, 200)
elif args.train_depth == "patch":
expected_test_input_shape = "TBD"
expected_img = "TBD"
raise Exception("Must be added")
msg = f"Expected {expected_test_input_shape} for shape, received {tuple(data[test_key]['test_input_shape'])} instead, in {args.infile.split('.')[0]}"
assert tuple(data[test_key]["test_input_shape"]) == expected_test_input_shape, msg
expected_test_label_shape = (200, 200, 200)
msg = f"Expected {expected_test_label_shape} for shape, received {tuple(data[test_key]['test_label_shape'])} instead, in {args.infile.split('.')[0]}"
assert tuple(data[test_key]["test_label_shape"]) == expected_test_label_shape, msg
for img in data[test_key]["img_shape"]:
msg = (
f"Expected {expected_img} for shape, received {tuple(img)} instead, in {args.infile.split('.')[0]}"
)
assert tuple(img) == expected_img, msg
# -----------------------------------------------
exp_n_section = data[test_key]["take_n_sections"]
pred_shape_len = len(data[test_key]["pred_shape"])
msg = f"Expected {exp_n_section} number of items, received {pred_shape_len} instead, in {args.infile.split('.')[0]}"
assert pred_shape_len == exp_n_section, msg
gt_shape_len = len(data[test_key]["gt_shape"])
msg = f"Expected {exp_n_section} number of items, received {gt_shape_len} instead, in {args.infile.split('.')[0]}"
assert gt_shape_len == exp_n_section, msg
img_shape_len = len(data[test_key]["img_shape"])
msg = f"Expected {exp_n_section} number of items, received {img_shape_len} instead, in {args.infile.split('.')[0]}"
assert img_shape_len == exp_n_section, msg
expected_len = 400
lhs_assertion = data[test_key]["test_section_loader_length"]
msg = f"Expected {expected_len} for test section loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
assert lhs_assertion == expected_len, msg
lhs_assertion = data[test_key]["test_loader_length"]
msg = f"Expected {expected_len} for test loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
assert lhs_assertion == expected_len, msg
expected_n_classes = 2
lhs_assertion = data[test_key]["n_classes"]
msg = f"Expected {expected_n_classes} for test loader length, received {lhs_assertion} instead, in {args.infile.split('.')[0]}"
assert lhs_assertion == expected_n_classes, msg
expected_pred = (1, 200, 200)
expected_gt = (1, 1, 200, 200)
for pred, gt in zip(data[test_key]["pred_shape"], data[test_key]["gt_shape"]):
# dimenstion
msg = f"Expected {expected_pred} for prediction shape, received {tuple(pred[0])} instead, in {args.infile.split('.')[0]}"
assert tuple(pred[0]) == expected_pred, msg
# unique classes
msg = f"Expected up to {expected_n_classes} unique prediction classes, received {pred[1]} instead, in {args.infile.split('.')[0]}"
assert pred[1] <= expected_n_classes, msg
# dimenstion
msg = f"Expected {expected_gt} for ground truth mask shape, received {tuple(gt[0])} instead, in {args.infile.split('.')[0]}"
assert tuple(gt[0]) == expected_gt, msg
# unique classes
msg = f"Expected up to {expected_n_classes} unique ground truth classes, received {gt[1]} instead, in {args.infile.split('.')[0]}"
assert gt[1] <= expected_n_classes, msg
elif args.step == "train":
if args.train_depth == "none":
expected_shape_in = (200, 200, 400)
elif args.train_depth == "section":
expected_shape_in = (200, 3, 200, 400)
elif args.train_depth == "patch":
expected_shape_in = "TBD"
raise Exception("Must be added")
msg = f"Expected {expected_shape_in} for shape, received {tuple(data['train_input_shape'])} instead, in {args.infile.split('.')[0]}"
assert tuple(data["train_input_shape"]) == expected_shape_in, msg
expected_shape_label = (200, 200, 400)
msg = f"Expected {expected_shape_label} for shape, received {tuple(data['train_label_shape'])} instead, in {args.infile.split('.')[0]}"
assert tuple(data["train_label_shape"]) == expected_shape_label, msg
expected_len = 64
msg = f"Expected {expected_len} for train patch loader length, received {data['train_patch_loader_length']} instead, in {args.infile.split('.')[0]}"
assert data["train_patch_loader_length"] == expected_len, msg
expected_len = 1280
msg = f"Expected {expected_len} for validation patch loader length, received {data['validation_patch_loader_length']} instead, in {args.infile.split('.')[0]}"
assert data["validation_patch_loader_length"] == expected_len, msg
expected_len = 64
msg = f"Expected {expected_len} for train subset length, received {data['train_length_subset']} instead, in {args.infile.split('.')[0]}"
assert data["train_length_subset"] == expected_len, msg
expected_len = 32
msg = f"Expected {expected_len} for validation subset length, received {data['validation_length_subset']} instead, in {args.infile.split('.')[0]}"
assert data["validation_length_subset"] == expected_len, msg
expected_len = 4
msg = f"Expected {expected_len} for train loader length, received {data['train_loader_length']} instead, in {args.infile.split('.')[0]}"
assert data["train_loader_length"] == expected_len, msg
expected_len = 1
msg = f"Expected {expected_len} for train loader length, received {data['train_loader_length']} instead, in {args.infile.split('.')[0]}"
assert data["validation_loader_length"] == expected_len, msg
expected_n_classes = 2
msg = f"Expected {expected_n_classes} for number of classes, received {data['n_classes']} instead, in {args.infile.split('.')[0]}"
assert data["n_classes"] == expected_n_classes, msg
logging.info("all done")
""" cmd-line arguments """
STEPS = ["test", "train"]
TRAIN_DEPTH = ["none", "patch", "section"]
parser.add_argument("--infile", help="Location of the file which has the metrics", type=str, required=True)
parser.add_argument(
"--step", choices=STEPS, type=str, required=True, help="Data flow checks for test or training pipeline"
)
parser.add_argument(
"--train_depth", choices=TRAIN_DEPTH, type=str, required=True, help="Train depth flag, to check the dimensions"
)
""" main wrapper with profiler """
if __name__ == "__main__":
main(parser.parse_args())

Просмотреть файл

@ -1,4 +1,7 @@
#!/usr/bin/env python3
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
""" Please see the def main() function for code description."""
import json
import math
@ -43,27 +46,20 @@ def main(args):
if args.test:
metrics_dict["Pixel Accuracy"] = "Pixel Acc: "
metrics_dict["Mean IoU"] = "Mean IoU: "
else:
else: # validation
metrics_dict["Pixel Accuracy"] = "pixacc"
metrics_dict["Mean IoU"] = "mIoU"
# process training set results
assert data[metrics_dict["Pixel Accuracy"]] > 0.0
assert data[metrics_dict["Pixel Accuracy"]] <= 1.0
assert data[metrics_dict["Mean IoU"]] > 0.0
assert data[metrics_dict["Mean IoU"]] <= 1.0
# check for actual values
math.isclose(data[metrics_dict["Pixel Accuracy"]], 1.0, abs_tol=ABS_TOL)
math.isclose(data[metrics_dict["Mean IoU"]], 1.0, abs_tol=ABS_TOL)
assert data[metrics_dict["Pixel Accuracy"]] > 0.97
assert data[metrics_dict["Mean IoU"]] > 0.97
assert data[metrics_dict["Pixel Accuracy"]] <= 1.0
assert data[metrics_dict["Mean IoU"]] <= 1.0
logging.info("all done")
""" GLOBAL VARIABLES """
# tolerance within which values are compared
ABS_TOL = 1e-3
""" cmd-line arguments """
parser.add_argument("--infile", help="Location of the file which has the metrics", type=str, required=True)
parser.add_argument(

Просмотреть файл

@ -20,14 +20,17 @@ def nbname(request):
def dataset_root(request):
return request.config.getoption("--dataset_root")
@pytest.fixture
def model_pretrained(request):
return request.config.getoption("--model_pretrained")
@pytest.fixture
def cwd(request):
return request.config.getoption("--cwd")
"""
def pytest_generate_tests(metafunc):
# This is called for every test. Only get/set command line arguments

Просмотреть файл

@ -38,7 +38,7 @@ DATA_F3="${DATA_F3}/data"
cd scripts
python gen_checkerboard.py --dataroot ${DATA_F3} --dataout ${DATA_CHECKERBOARD}
python gen_synthetic_data.py --dataroot ${DATA_F3} --dataout ${DATA_CHECKERBOARD} --type checkerboard --based_on fixed_box_number
# finished data download and generation
@ -50,4 +50,6 @@ python prepare_dutchf3.py split_train_val patch --data_dir=${DATA_F3} --label_
DATA_CHECKERBOARD="${DATA_CHECKERBOARD}/data"
# repeat for checkerboard dataset
python prepare_dutchf3.py split_train_val section --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --split_direction=both
python prepare_dutchf3.py split_train_val patch --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both
python prepare_dutchf3.py split_train_val patch --data_dir=${DATA_CHECKERBOARD} --label_file=train/train_labels.npy --output_dir=splits --stride=50 --patch_size=100 --split_direction=both --section_stride=100