TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
Перейти к файлу
Adam J. Stewart 29edfe8adb
gitattributes: allow diff of test data (#470)
* gitattributes: allow diff of data.py files

* Allow diffs of all text files, not just data.py
2022-03-19 10:31:03 -05:00
.github Fix PyTorch + setuptools bug (#357) 2022-01-12 15:57:49 -06:00
conf extract_archive: support deflate64-compressed zip files (#282) 2022-01-14 23:14:51 -06:00
docs Add OpenBuildings dataset (#402) 2022-02-27 20:33:39 +00:00
experiments Move DataModules to torchgeo.datamodules (#321) 2021-12-23 20:10:50 -06:00
logo Add favicon to ReadTheDocs 2021-09-08 16:08:04 -05:00
tests Fix integration tests on macOS/Windows (#468) 2022-03-19 10:30:20 -05:00
torchgeo VectorDataset: fix issue with empty query (#467) 2022-03-19 10:30:02 -05:00
.codecov.yml Remove Codecov annotations from PRs 2021-09-19 11:07:39 -05:00
.gitattributes gitattributes: allow diff of test data (#470) 2022-03-19 10:31:03 -05:00
.gitignore Ignore PDF figures 2021-10-09 11:58:19 -05:00
.pre-commit-config.yaml Update hook (#464) 2022-03-15 08:36:51 -05:00
.readthedocs.yaml Remove type ignores for PyTorch (#460) 2022-03-14 20:35:37 +00:00
CITATION.cff Use bibtex format auto-generated by GitHub 2021-11-17 22:39:54 -06:00
CODE_OF_CONDUCT.md Add Microsoft open-source template 2021-05-21 11:35:58 -05:00
LICENSE Add Microsoft open-source template 2021-05-21 11:35:58 -05:00
README.md Remove sphinx CI test (#292) 2021-12-17 16:55:20 -08:00
SECURITY.md Add Microsoft open-source template 2021-05-21 11:35:58 -05:00
SUPPORT.md Add Microsoft open-source template 2021-05-21 11:35:58 -05:00
benchmark.py Remove type ignores for PyTorch (#460) 2022-03-14 20:35:37 +00:00
environment.yml torchmetrics: IoU -> JaccardIndex (#361) 2022-01-18 20:34:15 +00:00
evaluate.py Remove type ignores for PyTorch (#460) 2022-03-14 20:35:37 +00:00
pyproject.toml Run linters on tests/data (#356) 2022-01-13 19:16:10 +00:00
setup.cfg Move flake8 configuration to setup.cfg (#398) 2022-02-15 10:53:56 -06:00
train.py Reorganize configuration files (#352) 2022-01-08 10:11:49 -06:00

README.md

TorchGeo

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

The goal of this library is to make it simple:

  1. for machine learning experts to use geospatial data in their workflows, and
  2. for remote sensing experts to use their data in machine learning workflows.

See our installation instructions, documentation, and examples to learn how to use TorchGeo.

External links: docs codecov pypi conda spack

Tests: style tests

Installation

The recommended way to install TorchGeo is with pip:

$ pip install torchgeo

For conda and spack installation instructions, see the documentation.

Documentation

You can find the documentation for TorchGeo on ReadTheDocs.

Example Usage

The following sections give basic examples of what you can do with TorchGeo. For more examples, check out our tutorials.

First we'll import various classes and functions used in the following sections:

from torch.utils.data import DataLoader
from torchgeo.datasets import CDL, COWCDetection, Landsat7, Landsat8, stack_samples
from torchgeo.samplers import RandomGeoSampler

Benchmark datasets

TorchGeo includes a number of benchmark datasets, datasets that include both input images and target labels. This includes datasets for tasks like image classification, regression, semantic segmentation, object detection, instance segmentation, change detection, and more.

If you've used torchvision before, these datasets should seem very familiar. In this example, we'll create a dataset for the Cars Overhead With Context (COWC) car detection dataset. This dataset can be automatically downloaded, checksummed, and extracted, just like with torchvision.

dataset = COWCDetection(root="...", split="train", download=True, checksum=True)

This dataset can then be passed to a PyTorch data loader.

dataloader = DataLoader(dataset, batch_size=128, shuffle=True, num_workers=4)

The only difference between a benchmark dataset in TorchGeo and a similar dataset in torchvision is that each dataset returns a dictionary with keys for each PyTorch Tensor.

for batch in dataloader:
    image = batch["image"]
    label = batch["label"]

    # train a model, or make predictions using a pre-trained model

Geospatial datasets

Many remote sensing applications involve working with generic geospatial data. This data can be challenging to work with due to the sheer variety of data. Geospatial imagery is often multispectral with a different number of spectral bands and spatial resolution for every satellite. In addition, each file may be in a different coordinate reference system (CRS), requiring the data to be reprojected into a matching CRS.

In this example, we show how easy it is to work with geospatial data and to sample small image patches from a combination of Landsat and Cropland Data Layer (CDL) data using TorchGeo. First, we assume that the user has Landsat 7 and 8 imagery downloaded. Since Landsat 8 has more spectral bands than Landsat 7, we'll only use the bands that both satellites have in common. We'll create a single dataset including all images from both Landsat 7 and 8 data by taking the union between these two datasets.

landsat7 = Landsat7(root="...")
landsat8 = Landsat8(root="...", bands=["B2", "B3", "B4", "B5", "B6", "B7", "B8", "B9"])
landsat = landsat7 | landsat8

Next, we take the intersection between this dataset and the Cropland Data Layer (CDL) dataset. We want to take the intersection instead of the union to ensure that we only sample from regions that have both Landsat and CDL data. Note that we can automatically download and checksum CDL data. Also note that each of these datasets may contain files in different coordinate reference systems (CRS) or resolutions, but TorchGeo automatically ensures that a matching CRS and resolution is used.

cdl = CDL(root="...", download=True, checksum=True)
dataset = landsat & cdl

This dataset can now be used with a PyTorch data loader. Unlike benchmark datasets, geospatial datasets often include very large images. For example, the CDL dataset consists of a single image covering the entire continental United States. In order to sample from these datasets using geospatial coordinates, TorchGeo defines a number of samplers. In this example, we'll use a random sampler that returns 256x256 pixel images and an epoch length of 10,000 images. We also use a custom collation function to combine each sample dictionary into a mini-batch of samples.

sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, collate_fn=stack_samples)

This data loader can now be used in your normal training/evaluation pipeline.

for batch in dataloader:
    image = batch["image"]
    mask = batch["mask"]

    # train a model, or make predictions using a pre-trained model

Train and test models using our PyTorch Lightning-based training script

We provide a script, train.py for training models using a subset of the datasets. We do this with the PyTorch Lightning LightningModules and LightningDataModules implemented under the torchgeo.trainers namespace. The train.py script is configurable via the command line and/or via YAML configuration files. See the conf/ directory for example configuration files that can be customized for different training runs.

$ python train.py config_file=conf/landcoverai.yaml

Citation

If you use this software in your work, please cite our paper:

@article{Stewart_TorchGeo_deep_learning_2021,
    author = {Stewart, Adam J. and Robinson, Caleb and Corley, Isaac A. and Ortiz, Anthony and Lavista Ferres, Juan M. and Banerjee, Arindam},
    journal = {arXiv preprint arXiv:2111.08872},
    month = {11},
    title = {{TorchGeo: deep learning with geospatial data}},
    url = {https://github.com/microsoft/torchgeo},
    year = {2021}
}

Contributing

This project welcomes contributions and suggestions. If you would like to submit a pull request, see our Contribution Guide for more information.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.