Uploading COCO

Convert coco format labels for upload to datagym

Coco format

Coco (Common Objects in Context) is a popular public dataset with its own label format. This label format combines all labels for the dataset into one common json file for each label task. As of 2019 COCO has focused on the tasks of Image Captioning, Object Detection (bbox and polygon), Keypoint Detection, Stuff (semantic segmentation of common background objects), Panoptic (advanced semantic segmentation).

How to use our coco importer

Prepare instances json file

The instances json file contains 5 main entries, 3 out of which contain the relevant information for a label upload (images, annotations, categories). The images list contains all images of the dataset with their metadata. The categories entry contains all categories with their relevant subcategories. The annotations' entry is the most important part as it lists all geometry labels (bbox and polygon) for the images. Use our coco class with its add_object_detection_data method to prepare the instances json for upload. When calling this function you can choose to either prepare the bbox annotation or the contained polygon for upload.

from datagym.importers.coco import Coco
coco = Coco()
with open(<PATH_TO_instances_YEAR.json>) as json_file:
instances_data = json.load(json_file)
# Here we chose to prepare the polygons.
coco.add_object_detection_data(instances_data, polygon=True)

Prepare captions json file

The captions json file has a similar layout to the instances json file. Image captions give a quick description of what can be seen on the image. In order to prepare your captions json file for upload you need to use the add_captions_data method of the coco class.

from datagym.importers.coco import Coco
coco = Coco()
with open(<PATH_TO_captions_YEAR.json>) as json_file:
captions_data = json.load(json_file)
# Here we chose to prepare the polygons.

Prepare your datagym project

Before you use the datagym coco importer make sure to upload the images to a datagym dataset with the same image name as in the coco json files. Within your python code you then generate an image_ids_dict so that the internal image id is connected to the external image name.

image_ids_dict = dict()
for image in dataset.images:
image_ids_dict[image.image_name] = image.id
print(f'One example from the image_ids_dict:\n {{"{image.image_name}" : "{image_ids_dict[image.image_name]}"}}')

Additionally the label configuration needs to be manually configured to contain all the relevant entries. In this example we wanted to upload the instances_val2017.json and captions_val2017.json to datagym and set up the following label configuration. Inside the label configuration it's important that the sub categories for each bounding box/polygon are named in the following way: "<category_name>_type"

Label configuration for instances_val2017.json and captions_val2017.json

Upload to datagym

Once the relevant coco labels are prepared for upload and the datagym project is prepared as described above, the labels are ready to be uploaded:

upload_dict = coco.get_datagym_label_dict(image_ids_dict)
project = client.get_project_by_name(project_name=PROJECT_NAME)
errors = client.import_label_data(project_id=project.id, label_data=upload_dict)