19 KiB
Coco format
In coco, we use file_name
and zip_file
to construct the file_path in ImageDataManifest
mentioned in README.md
. If zip_file
is present, it means that the image is zipped into a zip file for storage & access, and the path within the zip is file_name
. If zip_file
is not present, the image path would just be file_name
.
Image classification (multiclass and multilabel)
Here is one example of the train.json, val.json, or test.json in the DatasetInfo
above. Note that the "id"
for images
, annotations
and categories
should be consecutive integers, starting from 1. Note that our lib might work with id starting from 0, but many tools like CVAT and official COCOAPI will fail.
{
"images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/siberian-kitten.jpg", "zip_file": "train_images.zip"},
{"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/kitten 3.jpg", "zip_file": "train_images.zip"}],
// file_name is the image path, which supports three formats as described in previous section.
"annotations": [
{"id": 1, "category_id": 1, "image_id": 1},
{"id": 2, "category_id": 1, "image_id": 2},
{"id": 3, "category_id": 2, "image_id": 2}
],
"categories": [{"id": 1, "name": "cat", "supercategory": "animal"}, {"id": 2, "name": "dog", "supercategory": "animal"}]
}
Object detection
{
"images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/siberian-kitten.jpg", "zip_file": "train_images.zip"},
{"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/kitten 3.jpg", "zip_file": "train_images.zip"}],
"annotations": [
{"id": 1, "category_id": 1, "image_id": 1, "bbox": [10, 10, 100, 100]},
{"id": 2, "category_id": 1, "image_id": 2, "bbox": [100, 100, 200, 200]},
{"id": 3, "category_id": 2, "image_id": 2, "bbox": [20, 20, 200, 200], "iscrowd": 1}
],
"categories": [{"id": 1, "name": "cat"}, {"id": 2, "name": "dog"}]
}
You might notice that for the 3rd box, there is a "iscrowd" field. It specifies whether the box is about a crowd of objects.
BBox Format
bbox format should be absolute pixel position following either ltwh: [left, top, width, height]
or ltrb: [left, top, right, bottom]
. ltwh
is the default format. To work with ltrb
, please specify bbox_format
to be ltrb
in coco json file.
{
"bbox_format": "ltrb",
"images": ...,
"annotations": ...,
"categories": ...
}
Note that
- Note that
ltrb
used to be default. If your coco annotations were prepared to work with this repo before version 0.1.2. Please add"bbox_format": "ltrb"
to your coco file. - Regardless of what format bboxes are stored in Coco file, when annotations are transformed into
ImageDataManifest
, the bbox will be unified intoltrb: [left, top, right, bottom]
.
Image caption
Here is one example of the json file for image caption task.
{
"images": [{"id": 1, "file_name": "train_images/honda.jpg", "zip_file": "train_images.zip"},
{"id": 2, "file_name": "train_images/kitchen.jpg", "zip_file": "train_images.zip"}],
"annotations": [
{"id": 1, "image_id": 1, "caption": "A black Honda motorcycle parked in front of a garage."},
{"id": 2, "image_id": 1, "caption": "A Honda motorcycle parked in a grass driveway."},
{"id": 3, "image_id": 2, "caption": "A black Honda motorcycle with a dark burgundy seat."},
],
}
Image text matching
Here is one example of the json file for image text matching task. match
is a float between [0, 1], where 0 means not match at all, 1 means perfect match
{
"images": [{"id": 1, "file_name": "train_images/honda.jpg", "zip_file": "train_images.zip"},
{"id": 2, "file_name": "train_images/kitchen.jpg", "zip_file": "train_images.zip"}],
"annotations": [
{"id": 1, "image_id": 1, "text": "A black Honda motorcycle parked in front of a garage.", "match": 0},
{"id": 2, "image_id": 1, "text": "A Honda motorcycle parked in a grass driveway.", "match": 0},
{"id": 3, "image_id": 2, "text": "A black Honda motorcycle with a dark burgundy seat.", "match": 1},
],
}
Image matting
Here is one example of the json file for image matting task. The "label" in the "annotations" can be one of the following formats:
- a local path to the label file
- a local path in a non-compressed zip file (
c:\foo.zip@bar.png
) - a url to the label file
Specifically, only image files are supported for the label files. The ground truth image should be one channel image (i.e. PIL.Image
mode "L", instead of "RGB") that has the same width and height with the image file. Refer to the images in tests/image_matting_test_data.zip as an example.
{
"images": [{"id": 1, "file_name": "train_images/image/test_1.jpg", "zip_file": "train_images.zip"},
{"id": 2, "file_name": "train_images/image/test_2.jpg", "zip_file": "train_images.zip"}],
"annotations": [
{"id": 1, "image_id": 1, "label": "image_matting_label/mask/test_1.png", "zip_file": "image_matting_label.zip"},
{"id": 2, "image_id": 2, "label": "image_matting_label/mask/test_2.png", "zip_file": "image_matting_label.zip"},
]
}
Visual Question Answering
VQA represents the problem where one asks a question about an image and a ground truth answer is associated.
{
"images": [
{"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg"},
{"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
],
"annotations": [
{"image_id": 1, "id": 1, "question": "what animal is in the image?", "answer": "a cat"},
{"image_id": 2, "id": 2, "question": "What is the title of the book on the shelf?", "answer": "How to make bread"}
]
}
Visual Object Grounding
Visual Object Grounding is a problem where a text query/question about an image is provided, and an answer/caption about the image along with the most relevant grounding(s) are returned.
A grounding is composed of three parts:
bbox
: bounding box around the region of interest, same with object detection task. Similarly, you can specifyltrb
orltwh
(default) in the Coco json. Regardlessly, the label manifest will store the bbox in [left, top, right, bottom] format like object detection.text
: description about the regiontext_span
: two ints (start-inclusive, end-exclusive), indicating the section of text that the region is relevant to in the answer/caption
{
"images": [
{"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg"},
{"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
],
"annotations": [
{
"image_id": 1,
"id": 1,
"question": "whats animal are in the image?",
"answer": "cat and bird",
"groundings": [
{"text": "a cat", "text_span": [0, 2], "bboxes": [[10, 10, 100, 100], [15, 15, 100, 100]]},
{"text": "a bird", "text_span": [3, 4], "bboxes": [[15, 15, 30, 30], [0, 10, 20, 20]]}
]
},
{
"image_id": 2,
"id": 2,
"question": "What is the title and auther of the book on the shelf?",
"answer": "Tile is baking and auther is John",
"groundings": [
{"text": "Title: Baking", "text_span": [0, 2], "bboxes": [[10, 10, 100, 100]]},
{"text": "Author: John", "text_span": [3, 4], "bboxes": [[0, 0, 50, 50], [15, 15, 25, 25]]}
]
}
]
}
Image regression
Here is one example of the json file for the image regression task, where the "target" in the "annotations" field is a real-valued number (e.g. a score, an age, etc.). Note that each image should only have one regression target (i.e. there should be exactly one annotation for each image).
{
"images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/image_1.jpg", "zip_file": "train_images.zip"},
{"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/image_2.jpg", "zip_file": "train_images.zip"}],
"annotations": [
{"id": 1, "image_id": 1, "target": 102.0},
{"id": 2, "image_id": 2, "target": 28.5}
]
}
Image retrieval
This task will be a pure representation of the data of images retrieved by text queries only.
{
"images": [
{"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg"},
{"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
],
"annotations": [
{"image_id": 1, "id": 1, "query": "Men eating a banana."},
{"image_id": 2, "id": 2, "query": "An apple on the desk."}
]
}
MultiTask dataset
Multitask dataset represents the kind of dataset, where a single set of images possesses multiple sets of annotations for different tasks of single/mutiple tasks mentioned above.
For example, a set of people images can have different attributes: gender/classification {make, female, other}, height/regression: {0-300cm}, person location/detection: {x, y, w, h}, etc.
To represent this kind of dataset, it is simple: create one independent coco file for each task:
people_dataset/
train_images/
...
test_images/
...
train_images.zip
test_images.zip
train_coco_gender.json
test_coco_gender.json
train_coco_height.json
test_coco_height.json
train_coco_location.json
test_coco_location.json
KeyValuePair dataset
It is a generic image-text datase. For each sample, the input consists of one or more images and a text. The output is represented as a dictionary, where keys are the fields of interests. Each dataset is associated with a schema to define the task, fields of interests and format of those fields. The schema format follows JSON Schema stype, and is defined below:
Property | Type | Details | Required? |
---|---|---|---|
name | string | schema name | yes |
description | string | detailed description of the schema. e.g. Extract defect location and type from an image of metal screws on an assembly line. | no, but strongly recommended to provide |
fieldSchema | dict[string|number|integer, FieldSchema] | schemas of fields | yes |
The schema of each field is defined by FieldSchema
, recursively:
Property | Type | Details | Required? |
---|---|---|---|
type | FieldValueType | JSON type: string, number, integer, boolean, array, object. | yes |
description | string | describes the field in more detail, | no |
examples | list[string] | examples of field content, | no |
classes | dict[str, ClassSchema] | dictionary that maps each class name to ClassSchema . |
no |
properties | dict[string, FieldSchema] | defines FieldSchema of each subfield, | yes when type is object |
items | FieldSchema | defines the FieldSchema for all items in array, | yes when type is array |
includeGrounding | boolean | whether annotation of this field has bbox groundings associated; if true, bboxes are stored in the groundings field of the annotation. bboxes follow BBox Format. Only support single-image annotation. |
No, default false |
Definition of ClassSchema
:
Property | Type | Details | Required? |
---|---|---|---|
description | string | describes the class in more detail, e.g., "long, thin, surface-level mark" | no. Default: null |
For example, a visual question answering task schema is:
{
"name": "Visual question answering",
"description": "Answer questions on given images and provide rationales.",
"fieldSchema": {
"answer": {
"type": "string",
"description": "Answer to the question."
},
"rationale": {
"type": "string",
"description": "Rationale of the answer."
}
}
}
The fields or interests are answer
and rationale
.
In addition, a defect detection schema can be defined as
{
"name": "Defect detection - screws",
"description": "Extract defect location and type from an image of metal screws on an assembly line",
"fieldSchema": {
"defects": {
"type": "array",
"description": "The defect types with bounding boxes detected in the image",
"items": {
"type": "string",
"description": "The type of defect detected",
"classes": {
"scratch": {"description": "long, thin, surface-level mark"},
"dent": {"description": "appears to be caving in"},
"discoloration": {"description": "color is abnormal"},
"crack": {"description": "deeper mark than a scratch"}
},
"includeGrounding": true
}
}
}
}
We can see it is an object detection task with four classes: scratch, dent, discoloration, crack.
More examples can be found at DATA_PREPARATION.md. More details can be found at vision-datasets/vision_datasets/key_value_pair/manifest.py
.
Once schema is defined, we can construct the dataset. In details, each sample consists of:
- input:
- images, image is optionally associated with a metadata dictionary which stores the text attributes of interest for the image. For example, image is a product catalog image:
{'metadata': {'catalog': true}}
, capture location of an image:{'metadata': {'location': 'street'}}
, information of the assembly component captured in image of a defect detection dataset:{'metadata': {'name': 'Hex Head Lag Screw', 'type': '3/8-inch x 4-inch'}}
- text (optional), a dictionary with keys being field names e.g.
{'text': {'question': 'a specific question related to the images input'}}
- images, image is optionally associated with a metadata dictionary which stores the text attributes of interest for the image. For example, image is a product catalog image:
- output:
- fields, a dictionary with keys being the fields of interest, values being dictionaries that store the actual field value in "value" and optionally a list of grounded bboxes in "groundings". "groundings" are for single-image annotation only. Each bbox follows BBox Format. The format of each field should comply to the defined
fieldSchema
.
- fields, a dictionary with keys being the fields of interest, values being dictionaries that store the actual field value in "value" and optionally a list of grounded bboxes in "groundings". "groundings" are for single-image annotation only. Each bbox follows BBox Format. The format of each field should comply to the defined
The dataset format is a simple variation of COCO, where image_id
of an annotation entry is replaced with image_ids
to support multi-image annotation.
In each annotation entry, fields
is required, text
is optional. In each image entry, metadata
is optional. Below is an example of multi-image question answering.
{
"images": [
{"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg", "metadata": {"location": "street"}},
{"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
],
"annotations": [
{
"id": 1, "image_ids": [1, 2],
"text": {"question": "What objects are unique in the first image compared to the second image?"},
"fields": {
"answer": {"value": "car"},
"rationale": {"value": "Both images capture street traffic, a car exists in the first image but not in the second."}
}
},
{
"id": 2, "image_ids": [2, 1],
"text": {"question": "Does the first image have more cars?"},
"fields": {
"answer": {"value": "yes"},
"rationale": {"value": "First image has no car, second image has one."}
}
}
]
}
Another example for object detection:
{
"images": [
{
"id": 1,
"width": 224,
"height": 224,
"file_name": "1.jpg",
"zip_file": "test.zip"
}
],
"annotations": [
{
"id": 1,
"image_ids": [1],
"fields": {
"defects": {
"value": [
{"value": "scratch", "groundings": [[10, 10, 10, 10], [30, 30, 10, 10]]},
{"value": "dent", "groundings": [[80, 80, 20, 20]]}
]
}
}
}
]
}