SSL4EO-L: add reproducibility instructions (#1416)

* Create landsat subdirectory

* Python scripts in superdirectory now

* Add README
This commit is contained in:
Adam J. Stewart 2023-06-15 11:58:43 -05:00 коммит произвёл GitHub
Родитель 2ba9207647
Коммит 1ad8caa754
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
20 изменённых файлов: 145 добавлений и 47 удалений

Просмотреть файл

@ -0,0 +1,99 @@
# SSL4EO-L Instructions
This README describes the steps to recreate the datasets and reproduce the results of the SSL4EO-L project.
## Sampling
The first step in creating the SSL4EO-L pre-training and benchmarking datasets is to choose locations from which to sample. The following scripts can be run to choose non-overlapping locations to sample.
```console
$ bash sample_30.sh # for TM, ETM+, OLI/TIRS
$ bash sample_60.sh # only for MSS
$ bash sample_conus.sh # for benchmark datasets
```
The first section of these scripts includes user-specific parameters that can be modified to change the behavior of the scripts. Of particular importance are:
* `SAVE_PATH`: controls where the sampling location CSV is saved to
* `START_INDEX`: index to start from (usually 0, can be increased to append more locations)
* `END_INDEX`: index to stop at (start with ~500K)
These scripts will download world city data and write `sampled_locations.csv` files to be used for downloading.
## Downloading
Next, you'll actually download the data.
```console
$ bash download_mss_raw.sh
$ bash download_tm_toa.sh
$ bash download_etm_toa.sh
$ bash download_etm_sr.sh
$ bash download_oli_tirs_toa.sh
$ bash download_oli_sr.sh
```
These scripts contain the following variables you may want to modify:
* `ROOT_DIR`: root directory containing all subdirectories
* `SAVE_PATH`: where the downloaded data is saved
* `MATCH_FILE`: the CSV created in the previous step
* `NUM_WOKERS`: number of parallel workers
* `START_INDEX`: index from which to start downloading
* `END_INDEX`: index at which to stop downloading
These scripts are designed for downloading the pre-training datasets. Each script can be easily modified to instead download the benchmarking datasets by changing the `MATCH_FILE`, `YEAR`, and `--dates` passed in to the download script. For ETM+ TOA, you'll also want to set a `--default-value` since you'll need to include nodata pixels due to SLC-off.
## Parallel corpus
For each TOA and SR product, we want to create a parallel corpus. This can be done by running:
```console
$ bash delete_mismatch.sh
```
You may want to modify `ROOT_DIR`.
## Compression
The final step in dataset creation is to convert float32 values to uint8 and create compressed COG files. This can be done by running:
```console
$ bash compress_tm_toa.sh
$ bash compress_etm_toa.sh
$ bash compress_etm_sr.sh
$ bash compress_oli_tirs_toa.sh
$ bash compress_oli_sr.sh
```
You may want to modify `ROOT_DIR` or `NUM_WORKERS`.
## Chipping
For the benchmark datasets, there is one additional step required. You should download NLCD and CDL files from the same years as the benchmark datasets, either manually or using TorchGeo. Then you should run:
```console
$ python3 chip_landsat_benchmark.py ...
```
This will create patches of NLCD and CDL data with the same locations and dimensions as the Landsat images you downloaded. Valid options can be found by passing `--help`.
## Running Experiments
Using either the newly created datasets or after downloading the datasets from Hugging Face, you can run each experiment using:
```console
$ python3 ../../../train.py config_file=...
```
The config files to be passed can be found in the `../../../conf/` directory. Feel free to tweak any hyperparameters you see in these files. The default values are the optimal hyperparameters we found.
## Plotting
The following scripts can be run to generate the plots in our paper:
```console
$ python3 plot_landsat_bands.py RBV MSS ETM --fig-height=3 # only TM, ETM+, OLI/TIRS
$ python3 plot_landsat_bands.py # all bands
$ python3 plot_landsat_timeline.py
```

Просмотреть файл

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SRC_DIR="$ROOT_DIR/ssl4eo-l7-l2"
DST_DIR="$ROOT_DIR/ssl4eo-l7-l2-v2"
SRC_DIR="$ROOT_DIR/ssl4eo_l_etm_sr"
DST_DIR="$ROOT_DIR/ssl4eo_l_etm_sr_v2"
NUM_WORKERS=40
# Satellite-specific parameters
@ -21,7 +21,7 @@ MIN=$R_MIN
MAX=$R_MAX
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/compress_dataset.py" \
"$SRC_DIR" \

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SRC_DIR="$ROOT_DIR/ssl4eo-l7-l1"
DST_DIR="$ROOT_DIR/ssl4eo-l7-l1-v2"
SRC_DIR="$ROOT_DIR/ssl4eo_l_etm_toa"
DST_DIR="$ROOT_DIR/ssl4eo_l_etm_toa_v2"
NUM_WORKERS=40
# Satellite-specific parameters
@ -25,7 +25,7 @@ MIN=($R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $T_MIN $T_MIN $R_MIN $R_MIN)
MAX=($R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $T_MAX $T_MAX $R_MAX $R_MAX)
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/compress_dataset.py" \
"$SRC_DIR" \

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SRC_DIR="$ROOT_DIR/ssl4eo-l8-l2"
DST_DIR="$ROOT_DIR/ssl4eo-l8-l2-v2"
SRC_DIR="$ROOT_DIR/ssl4eo_l_oli_sr"
DST_DIR="$ROOT_DIR/ssl4eo_l_oli_sr_v2"
NUM_WORKERS=40
# Satellite-specific parameters
@ -21,7 +21,7 @@ MIN=$R_MIN
MAX=$R_MAX
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/compress_dataset.py" \
"$SRC_DIR" \

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SRC_DIR="$ROOT_DIR/ssl4eo-l8-l1"
DST_DIR="$ROOT_DIR/ssl4eo-l8-l1-v2"
SRC_DIR="$ROOT_DIR/ssl4eo_l_oli_tirs_toa"
DST_DIR="$ROOT_DIR/ssl4eo_l_oli_tirs_toa_v2"
NUM_WORKERS=40
# Satellite-specific parameters
@ -25,7 +25,7 @@ MIN=($R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $T_MIN $T_MI
MAX=($R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $T_MAX $T_MAX)
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/compress_dataset.py" \
"$SRC_DIR" \

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SRC_DIR="$ROOT_DIR/ssl4eo-l5-l1"
DST_DIR="$ROOT_DIR/ssl4eo-l5-l1-v2"
SRC_DIR="$ROOT_DIR/ssl4eo_l_tm_toa"
DST_DIR="$ROOT_DIR/ssl4eo_l_tm_toa_v2"
NUM_WORKERS=40
# Satellite-specific parameters
@ -25,7 +25,7 @@ MIN=($R_MIN $R_MIN $R_MIN $R_MIN $R_MIN $T_MIN $R_MIN)
MAX=($R_MAX $R_MAX $R_MAX $R_MAX $R_MAX $T_MAX $R_MAX)
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/compress_dataset.py" \
"$SRC_DIR" \

Просмотреть файл

@ -7,15 +7,14 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
L5_L1="$ROOT_DIR/ssl4eo-l5-l1/imgs"
L7_L1="$ROOT_DIR/ssl4eo-l7-l1/imgs"
L7_L2="$ROOT_DIR/ssl4eo-l7-l2/imgs"
L8_L1="$ROOT_DIR/ssl4eo-l8-l1/imgs"
L8_L2="$ROOT_DIR/ssl4eo-l8-l2/imgs"
L5_L1="$ROOT_DIR/ssl4eo_l_tm_toa/imgs"
L7_L1="$ROOT_DIR/ssl4eo_l_etm_toa/imgs"
L7_L2="$ROOT_DIR/ssl4eo_l_etm_sr/imgs"
L8_L1="$ROOT_DIR/ssl4eo_l_oli_tirs_toa/imgs"
L8_L2="$ROOT_DIR/ssl4eo_l_oli_sr/imgs"
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
time python3 "$SCRIPT_DIR/delete_mismatch.py" "$L7_L1" "$L7_L2" --delete-different-locations --delete-different-dates
time python3 "$SCRIPT_DIR/delete_mismatch.py" "$L8_L1" "$L8_L2" --delete-different-locations --delete-different-dates
time python3 "$SCRIPT_DIR/delete_mismatch.py" "$L5_L1" "$L7_L1" "$L7_L2" "$L8_L1" "$L8_L2" --delete-different-locations

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l7-l2"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-30/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_etm_sr"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_30/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -25,7 +25,7 @@ NEW_RESOLUTIONS=30
DEFAULT_VALUE=0
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l7-l1"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-30/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_etm_toa"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_30/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -24,7 +24,7 @@ ORIGINAL_RESOLUTIONS=(30 30 30 30 30 60 60 30 15)
NEW_RESOLUTIONS=30
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l5-raw"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-60/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_mss_raw"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_60/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -24,7 +24,7 @@ ORIGINAL_RESOLUTIONS=(60 60 60 30)
NEW_RESOLUTIONS=60
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l8-l2"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-30/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_oli_sr"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_30/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -25,7 +25,7 @@ NEW_RESOLUTIONS=30
DEFAULT_VALUE=0
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l8-l1"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-30/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_oli_tirs_toa"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_30/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -25,7 +25,7 @@ ORIGINAL_RESOLUTIONS=(30 30 30 30 30 30 30 15 30 30 30)
NEW_RESOLUTIONS=30
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -7,8 +7,8 @@ set -euo pipefail
# User-specific parameters
ROOT_DIR=data
SAVE_PATH="$ROOT_DIR/ssl4eo-l5-l1"
MATCH_FILE="$ROOT_DIR/ssl4eo-l-30/sampled_locations.csv"
SAVE_PATH="$ROOT_DIR/ssl4eo_l_tm_toa"
MATCH_FILE="$ROOT_DIR/ssl4eo_l_30/sampled_locations.csv"
NUM_WORKERS=40
START_INDEX=0
END_INDEX=10
@ -24,7 +24,7 @@ ORIGINAL_RESOLUTIONS=(30 30 30 30 30 30 30)
NEW_RESOLUTIONS=30
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
CLOUD_PCT=20
SIZE=264
DTYPE=float32

Просмотреть файл

@ -6,12 +6,12 @@
set -euo pipefail
# User-specific parameters
SAVE_PATH=data/ssl4eo-l-30
SAVE_PATH=data/ssl4eo_l_30
START_INDEX=0
END_INDEX=10
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
RES=30
SIZE=264
NUM_CITIES=10000

Просмотреть файл

@ -6,12 +6,12 @@
set -euo pipefail
# User-specific parameters
SAVE_PATH=data/ssl4eo-l-60
SAVE_PATH=data/ssl4eo_l_60
START_INDEX=0
END_INDEX=10
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
RES=60
SIZE=264
NUM_CITIES=10000

Просмотреть файл

@ -6,12 +6,12 @@
set -euo pipefail
# User-specific parameters
SAVE_PATH=data/ssl4eo-l-conus
SAVE_PATH=data/ssl4eo_l_conus
START_INDEX=0
END_INDEX=1000
END_INDEX=10
# Generic parameters
SCRIPT_DIR=$(cd $(dirname "${BASH_SOURCE[0]}") && pwd)
SCRIPT_DIR=$(cd $(dirname $(dirname "${BASH_SOURCE[0]}")) && pwd)
SIZE=264
RES=30