README.md: describe the new ingest pipeline

This commit is contained in:
Peter Williams 2024-07-01 10:45:41 -04:00
Родитель d398ded2f4
Коммит e7de1819bf
1 изменённых файлов: 75 добавлений и 6 удалений

Просмотреть файл

@ -1,9 +1,10 @@
# wwt-core-catalogs
The purpose of this repository is to manage the image data comprising WWT's
core data holdings. Two major applications are maintenance of the WWT
"legacy" WTML/XML metadata files, and automating the ingestion of the core
data in the Constellations system.
The purpose of this repository is to manage the image data comprising WWT's core
data holdings. Three major applications are maintenance of the WWT "legacy"
WTML/XML metadata files, automating the ingestion of the core data in the
Constellations system, and automating the ingestion of new content that's not
yet in Constellations *or* the legacy system.
Python package requirements of note:
@ -103,9 +104,9 @@ curl -fsSL "https://www.astropix.org/link/39po?format=json" -o astropix/all.json
database: "publisher ID is not adsdadasdasdas", basically.)
## Driver
## Main Driver
Operations are driven from the script `./cattool.py`, which has as Git-like
Most operations are driven from the script `./cattool.py`, which has as Git-like
subcommand interface.
@ -362,6 +363,74 @@ script to gradually import data from the core corpus into the Constellations
system.
## Data Ingest Pipeline
Steps to ingest new data are driven from the script `./pipeline.py`, which also
has as Git-like subcommand interface. It is derived from the [Toasty Pipeline]
framework but adds in elements from the `cxprep` framework as well. Note that
this framework is intended for *new* images that havent yet been ingested into
the local databases (e.g., `imagesets` and `places`); for ones that have been,
use the `cxprep` framework.
[Toasty Pipeline]: https://toasty.readthedocs.io/en/latest/pipeline.html
To prepare to run the pipeline, do the following:
1. Make sure that you've downloaded the AstroPix database, as describe in
[Approach: AstroPix](#approach-astropix).
2. Change to a sub-directory in the `feeds` directory, e.g., `feeds/hst`.
3. Ensure that a `corepipe-storage.yaml` file exists there. This has the same
format as the `toasty-store-config.yaml` file created by the [`toasty
pipeline init`] command — you can just copy an existing file from the Toasty
framework into this one.
4. Possibly run the `pipeline backfill` command, described below, to backfill
data from the Toasty pipeline framework.
Once you're prepared, the pipeline steps are as follows:
- The `../../pipeline.py refresh` command downloads information about images that
could be processed.
- The `../../pipeline.py fetch <IDS...>` command selects specific images for
processing, downloading their data.
- The `../../pipeline.py process-todos` command tiles all images that have been
fetched, and inserts their information into the `prep.txt` information file
associated with the feed.
- At this point, you can review tiled images and edit their metadata in the
`prep.txt` file. When an image is ready to publish, remove the `wip: yes`
statement from its record in the `prep.txt` file.
- The `../../pipeline.py upload` processes all images that have been marked as
ready (not `wip`) in the `prep.txt` file. It updates the WTML files based on
any edits to `prep.txt`; uploads the associated data to the cloud; register
the images with Constellations *in the unpublished state*; and also adds them
to the `imagesets` and `places` databases. After uploading, you should `git
commit` the local database updates and push your changes.
- Finally, for all new scenes that are uploaded to Constellations, you should
review their display in the Constellations UI and mark them as "Published" to
make them publicly visible once theyre ready.
[`toasty pipeline init`]: https://toasty.readthedocs.io/en/latest/cli/pipeline-init.html
### `pipeline backfill <WTML-FILE>`
This command is intended to help backfill data into this ingest pipeline from
a preexisting setup for the [Toasty Pipeline].
The argument should be a single merged WTML file that contains information about
*all* of the images currently being worked on as part of a Toasty Pipeline
session. This could be generated with something like [`wwtdatatool wtml merge`].
This command will process that file and use it to generate the `prep.txt` file
that drives the publication process. The unique IDs of the images are inferred
from the thumbnail image URL associated with each imageset.
[`wwtdatatool wtml merge`]: https://wwt-data-formats.readthedocs.io/en/latest/cli/wtml-merge.html
Once you have done this, you need to copy the Toasty Pipeline `processed` and/or
`uploaded` directories into the relevant feed directory here. The format of the
files in those directories is identical to the Toasty Pipeline framework.
## See also
The [`wwt-hips-list-importer`][hips] repo contains a script for generating the