ORBIT-Dataset/orbit_challenge_getting_sta...

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ORBIT Challenge - Getting Started\n",
    "\n",
    "This notebook will step you through a simple starter task which you can use to get you started on the [ORBIT Few-Shot Object Recognition Challenge 2023](https://eval.ai/web/challenges/challenge-page/1896). In this starter task, you will download a few-shot learning model (Prototypical Networks, Snell et al., 2017) trained on the ORBIT train set, and use it to generate frame predictions on the ORBIT validation set. The predictions will be saved in a JSON in the format required by the Challenge's evaluation server. You can upload this JSON under the 'Starter Task' phase on the evaluation server to check your implementation.\n",
    "\n",
    "This notebook has been tested using the conda environment specified in [environment.yml](environment.yml)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need a local copy of the ORBIT dataset. If you already have a copy of the data, you can skip this step!\n",
    "\n",
    "In this script, we will download a local copy of the validation data (already resized to 224x224 frames) as well as extra frame annotations (e.g. object bounding boxes, quality issues) for the train, validation and test data. This will take ~4.3GB of disk space. Note, the validation data comes from 6 validation users and is used here as a starter task. For the main Challenge, you will need to use the test data which comes from a different set of 17 test users. \n",
    "\n",
    "To download the full dataset, you can use [download_pretrained_dataset.py](scripts/download_pretrained_dataset.py). The full dataset takes up 83GB (1080x1080 frames) or 54GB (224x224 frames)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "DATA_ROOT = \"orbit_benchmark_224\" # note, we are downloading the validation set already resized to 224x224 frames\n",
    "DATA_SPLIT = \"validation\"\n",
    "validation_path = Path(DATA_ROOT, DATA_SPLIT)\n",
    "annotation_path = Path(DATA_ROOT, 'annotations')\n",
    "\n",
    "# download validation split\n",
    "if not validation_path.is_dir():\n",
    "    validation_path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    print(\"Downloading validation.zip...\")\n",
    "    !wget -O validation.zip https://city.figshare.com/ndownloader/files/28368351\n",
    "\n",
    "    print(\"Unzipping validation.zip...\")\n",
    "    !unzip -q validation.zip -d {DATA_ROOT}\n",
    "\n",
    "    if not validation_path.is_dir():\n",
    "        raise ValueError(f\"Path {validation_path} is not a directory.\")\n",
    "    else:\n",
    "        print(f\"Dataset ready at {validation_path}.\")\n",
    "    # You can now delete the zip file.\n",
    "else:\n",
    "    print(f\"Dataset already saved at {validation_path}.\")\n",
    "\n",
    "# download (train, validation and test) annotations\n",
    "if not annotation_path.is_dir():\n",
    "    annotation_path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    print(\"Downloading orbit_extra_annotations.zip...\")\n",
    "    !wget -O orbit_extra_annotations.zip https://github.com/microsoft/ORBIT-Dataset/raw/dev/data/orbit_extra_annotations.zip\n",
    "\n",
    "    print(\"Unzipping orbit_extra_annotations.zip...\")\n",
    "    !unzip -q orbit_extra_annotations.zip -d {annotation_path}\n",
    "\n",
    "    if not annotation_path.is_dir():\n",
    "        raise ValueError(f\"Path {annotation_path} is not a directory.\")\n",
    "    else:\n",
    "        print(f\"Annotations ready at {annotation_path}.\")\n",
    "    # You can now delete the zip file.\n",
    "else:\n",
    "    print(f\"Annotations already saved at {annotation_path}.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can create an instance of the dataset. This creates a queue of tasks from the dataset that can be divided between multiple workers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data.queues import UserEpisodicDatasetQueue\n",
    "\n",
    "print(\"Creating data queue...\")\n",
    "data_queue = UserEpisodicDatasetQueue(\n",
    "    root=Path(DATA_ROOT, DATA_SPLIT), # path to data\n",
    "    way_method=\"max\", # sample all objects per user\n",
    "    object_cap=15, # cap number of objects per user to 15 (no ORBIT user has >15 objects)\n",
    "    shot_method=[\"max\", \"max\"], # sample [all context videos, all target videos] per object\n",
    "    shots=[5, 2], # only relevant if shot_method contains strings \"specific\" or \"fixed\"\n",
    "    video_types=[\"clean\", \"clutter\"], # sample clips from [clean context videos, clutter target videos]\n",
    "    subsample_factor=30, # sample every 30th clip from a video if clip_method = uniform\n",
    "    clip_methods=[\"uniform\", \"random_200\"], # sample [clips uniformly from each context video, 200 random target clips per target video]; note if test_mode=True, target clips will be flattened into a list of frames\n",
    "    clip_length=1, # sample 1 frame per clip. Can be increased to sample multiple frames per clip.\n",
    "    frame_size=224, # width and height of frame\n",
    "    frame_norm_method='openai_clip', # normalize frames using CLIP's statistics since we're using CLIP's vision encoder (see below).\n",
    "    annotations_to_load=[], # do not load any frame annotations\n",
    "    filter_by_annotations=[[], ['no_object_not_present_issue']], # only includes target frames with the 'object_not_present_issue=False' tag. Note, context frames are not filtered as extra annotations cannot be used for personalisation.\n",
    "    num_tasks=10, # sample 10 tasks per user. Note, this is just for the starter task. The full challenge will require you to sample 50 tasks per user.\n",
    "    test_mode=True, # sample test (rather than train) tasks\n",
    "    with_cluster_labels=False, # use user's personalised object names as labels, rather than broader object categories\n",
    "    with_caps=False, # do not impose any sampling caps\n",
    "    shuffle=False, # do not shuffle task data\n",
    "    num_workers=2 # use 2 workers to load data\n",
    ")\n",
    "\n",
    "print(f\"Created data queue, queue uses {data_queue.num_workers} workers.\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to set up the model. For the starter task, we will use a few-shot learning model called Prototypical Networks (Snell et al., 2017) using a Euclidean distance. The model uses CLIP's (Transformer) vision encoder as a feature extractor (i.e. 'vit_b_32_clip'). We then meta-train this model on the ORBIT train users for the CLUVE or Clutter Video Evaluation task (trained on 224x224 frame size, using LITE). First, we download the checkpoint file that corresponds to this model. We then create an instance of the model using the pretrained weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "checkpoint_name = \"orbit_cluve_protonets_vit_b_32_clip_224_lite.pth\"\n",
    "checkpoint_path = Path(\"orbit_pretrained_checkpoints\", checkpoint_name)\n",
    "\n",
    "if not checkpoint_path.exists():\n",
    "    checkpoint_path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    print(\"Downloading checkpoint file...\")\n",
    "    !wget -O orbit_pretrained_checkpoints/$checkpoint_name https://taixmachinelearning.z5.web.core.windows.net/$checkpoint_name\n",
    "    print(f\"Checkpoint saved to {checkpoint_path}.\")\n",
    "else:\n",
    "    print(f\"Checkpoint already exists at {checkpoint_path}.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from model.few_shot_recognisers import SingleStepFewShotRecogniser\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    device = torch.device(\"cuda:0\")\n",
    "    map_location = lambda storage, _: storage.cuda()\n",
    "else:\n",
    "    device = torch.device(\"cpu\")\n",
    "    map_location = lambda storage, _: storage.cpu()\n",
    "\n",
    "model = SingleStepFewShotRecogniser(\n",
    "    feature_extractor_name=\"vit_b_32_clip\", # feature extractor is CLIP's vision encoder (a ViT-B-32 architecture)\n",
    "    adapt_features=False, # do not generate FiLM Layers\n",
    "    classifier=\"proto\", # use a Prototypical Networks classifier head, with a cosine rather than Euclidean distance metric\n",
    "    clip_length=1, # number of frames per clip; frame features are mean-pooled to get the clip feature\n",
    "    batch_size=256, # number of clips within a task to process at a time\n",
    "    learn_extractor=False, # only relevant when training ProtoNets\n",
    "    num_lite_samples=16, # only relevant when training with LITE\n",
    "    logit_scale=1.0 # scalar to scale logits (32.0 for proto_cosine, but typically 1.0 for proto)\n",
    ")\n",
    "\n",
    "model._set_device(device)\n",
    "model._send_to_device()\n",
    "# load in the pretrained checkpoint weights\n",
    "model.load_state_dict(torch.load(checkpoint_path, map_location=map_location), strict=False)\n",
    "# set the model to evaluation mode (ensures batch norm modules are in the correct state)\n",
    "model.set_test_mode(True)\n",
    "print(f\"Instance of SingleStepFewShotRecogniser created on device {device}.\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now going to run our data through our model. We go through each task (10 tasks per user, since we specified `num_tasks = 10` above) and use the task's context clips to create a personalized model for that user's task. We then evaluate the personalized model on each frame in the task's target videos.\n",
    "\n",
    "The results for each task will be saved to a JSON file (this is what should be submitted to the evaluation server) and the aggregate stats will be printed to the console. You should get a frame accuracy of 85.67 +/- 1.50% - see `Acccuracy (averaged per video) (leaderboard metric)`, and a MACs to personalise of 4.78T +/- 1.27T - see `MACs to personalise (averaged per task) (leaderboard metric)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import time\n",
    "from tqdm.notebook import tqdm\n",
    "from typing import Dict, Tuple\n",
    "from data.utils import attach_frame_history\n",
    "from utils.eval_metrics import TestEvaluator\n",
    "\n",
    "output_dir = Path(\"output\", DATA_SPLIT)\n",
    "output_dir.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "metrics = ['frame_acc']\n",
    "evaluator = TestEvaluator(metrics, output_dir, with_ops_counter=True)\n",
    "evaluator.set_base_params(model)\n",
    "print(evaluator.check_for_uncounted_modules(model)) # check if the MACs of any modules are not being counted\n",
    "\n",
    "def get_stats_str(stats: Dict[str, Tuple[float, float]], dps: int=2) -> str:\n",
    "    stats_str = \"\\t\".join([f\"{metric}: {stats[metric][0]*100:.{dps}f} ({stats[metric][1]*100:.{dps}f})\" for metric in metrics])\n",
    "    return stats_str\n",
    "\n",
    "num_context_clips_per_task = []\n",
    "num_target_clips_per_task = []\n",
    "num_test_tasks = data_queue.num_users * data_queue.num_tasks\n",
    "with torch.no_grad():\n",
    "    for step, task in enumerate(tqdm(data_queue.get_tasks(), desc=f\"Running evaluation on {data_queue.num_tasks} tasks per test user\", total=num_test_tasks)):\n",
    "        context_clips = task[\"context_clips\"].to(device)        # Torch tensor of shape: (N, clip_length, C, H, W), dtype float32\n",
    "        context_labels = task[\"context_labels\"].to(device)      # Torch tensor of shape: (N), dtype int64\n",
    "        object_list = task[\"object_list\"]                       # List of str of length num_objects\n",
    "        num_context_clips = len(context_clips)                  # num_context_clips will be the same for all tasks of the same user \n",
    "                                                                # since we're sampling frames uniformly from each videos (clip_method = uniform)\n",
    "                                                                # and we're sampling from all the user's videos (shot method = max)\n",
    "\n",
    "        # log task in evaluator\n",
    "        evaluator.set_task_object_list(object_list)\n",
    "        #evaluator.set_task_context_paths(task[\"context_paths\"])\n",
    "\n",
    "        # personalise the pre-trained model to the current task and log the time it took\n",
    "        t1 = time.time()\n",
    "        model.personalise(context_clips, context_labels, ops_counter=evaluator.ops_counter)\n",
    "        evaluator.log_time(time.time() - t1, 'personalise')\n",
    "\n",
    "        # loop through each of the user's target videos, and get predictions from the personalised model for every frame\n",
    "        num_target_clips = 0\n",
    "        for video_frames, video_paths, video_label in zip(task['target_clips'], task[\"target_paths\"], task['target_labels']):\n",
    "            # video_frames is a Torch tensor of shape (frame_count, C, H, W), dtype float32\n",
    "            # video_paths is a Torch tensor of shape (frame_count), dtype object (Path)\n",
    "            # video_label is single int64\n",
    "\n",
    "            # first, for each frame, attach a short history of its previous frames if clip_length > 1\n",
    "            video_frames_with_history = attach_frame_history(video_frames, model.clip_length)      # Torch tensor of shape: (frame_count, clip_length, C, H, W), dtype float32\n",
    "            num_clips = len(video_frames_with_history)\n",
    "\n",
    "            t1 = time.time()\n",
    "            # get predicted logits for each frame and log time per clip\n",
    "            logits = model.predict(video_frames_with_history)                                      # Torch tensor of shape: (frame_count, num_objects), dtype float32\n",
    "            evaluator.log_time((time.time() - t1)/float(num_clips*model.clip_length), 'inference') # log inference time per frame (so average over num_clips*clip_length)\n",
    "            evaluator.append_video(logits, video_label, video_paths)\n",
    "            num_target_clips += num_clips\n",
    "\n",
    "        # reset model for next task\n",
    "        model._reset()\n",
    "\n",
    "        # complete the task (required for correct ops counter numbers)\n",
    "        evaluator.task_complete()\n",
    "        num_context_clips_per_task.append(num_context_clips)\n",
    "        num_target_clips_per_task.append(num_target_clips)\n",
    "\n",
    "        # check if the user has any more tasks; if tasks_per_user == 50, we reset every 50th task.\n",
    "        if (step+1) % data_queue.num_tasks == 0:\n",
    "            evaluator.set_current_user(task[\"task_id\"])\n",
    "            _,_,_,current_video_stats = evaluator.get_mean_stats(current_user=True)\n",
    "            current_macs_mean,_,_,_ = evaluator.get_mean_ops_counter_stats(current_user=True)\n",
    "            tqdm.write(f\"User {task['task_id']} ({evaluator.current_user+1}/{len(data_queue)}) {get_stats_str(current_video_stats)}, avg MACs to personalise/task: {current_macs_mean}, avg # context clips/task: {np.mean(num_context_clips_per_task):.0f}, avg # target clips/task: {np.mean(num_target_clips_per_task):.0f}\")\n",
    "            if (step+1) < num_test_tasks:\n",
    "                num_context_clips_per_task = []\n",
    "                num_target_clips_per_task = []\n",
    "                evaluator.next_user()\n",
    "        else:\n",
    "            evaluator.next_task()\n",
    "\n",
    "# Compute the aggregate statistics averaged over users and averged over videos. We use the video aggregate stats for the competition.\n",
    "stats_per_user, stats_per_obj, stats_per_task, stats_per_video = evaluator.get_mean_stats()\n",
    "mean_macs, std_macs, mean_params, params_breakdown  = evaluator.get_mean_ops_counter_stats()\n",
    "mean_personalise_time, std_personalise_time, mean_inference_time, std_inference_time = evaluator.get_mean_times()\n",
    "print('-'*20)\n",
    "print(f\"Time to personalise (averaged per task): {mean_personalise_time} ({std_personalise_time})\")\n",
    "print(f\"Inference time per frame (averaged per task): {mean_inference_time} ({std_inference_time})\")\n",
    "print(f\"Number of params {mean_params} ({params_breakdown})\")\n",
    "print(f\"Accuracy (averaged per user): {get_stats_str(stats_per_user)}\")\n",
    "print(f\"Accuracy (averaged per object): {get_stats_str(stats_per_obj)}\")\n",
    "print(f\"Accuracy (averaged per task): {get_stats_str(stats_per_task)}\")\n",
    "print(f\"Accuracy (averaged per video) (leaderboard metric): {get_stats_str(stats_per_video)}\")\n",
    "print(f\"MACs to personalise (averaged per task) (leaderboard metric): {mean_macs} ({std_macs})\")\n",
    "evaluator.save()\n",
    "print(f\"Results saved to {evaluator.json_results_path}.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": ".venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.-1"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "35a491206864cd1cd42eca7a8f938db8ccb579057bd59c1d04d1dc1f060cefe6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}