Merge pull request #593 from microsoft/staging

Sync master <-> staging
2020-07-10 20:53:12 +00:00 · 2020-07-10 20:53:12 +00:00 · c7d01cf0b8
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 <img src="scenarios/media/logo_cvbp.png" align="right" alt="" width="300"/>

 ```diff
-+ Update June 24: Added action recognition as new core scenario.
-+                 Object tracking coming soon (in 2-4 weeks).
+ Update July: Added support for action recognition and tracking
+              in the new release v1.2.
 ```

 # Computer Vision
@ -55,6 +55,7 @@ The following is a summary of commonly used Computer Vision scenarios that are c
 | [Keypoints](scenarios/keypoints) | Base | Keypoint detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
 | [Segmentation](scenarios/segmentation) | Base | Image Segmentation assigns a category to each pixel in an image. |
 | [Action recognition](scenarios/action_recognition) | Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. We also implemented the i3d implementation of action recognition that can be found under (contrib)[contrib]. |
+| [Tracking](scenarios/tracking) | Base | Tracking allows to detect and track multiple objects in a video sequence over time. |
 | [Crowd counting](contrib/crowd_counting) | Contrib | Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|

 We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cv" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.
--- a/scenarios/README.md
+++ b/scenarios/README.md
@ -8,6 +8,7 @@
 | [Keypoints](keypoints) | Keypoint Detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
 | [Segmentation](segmentation) | Image Segmentation assigns a category to each pixel in an image. |
 | [Action Recognition](action_recognition) | Action Recognition (also known as activity recognition) consists of classifying various actions from a sequence of frames, such as "reading" or "drinking". |
+| [Tracking](tracking) | Tracking allows to detect and track multiple objects in a video sequence over time. |


 # Scenarios
--- a/scenarios/tracking/01_training_introduction.ipynb
+++ b/scenarios/tracking/01_training_introduction.ipynb
--- a/scenarios/tracking/02_mot_challenge.ipynb
+++ b/scenarios/tracking/02_mot_challenge.ipynb
@ -0,0 +1,384 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
+    "\n",
+    "<i>Licensed under the MIT License.</i>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluating a Multi-Object Tracking Model on MOT Challenge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook provides a framework for evaluating [FairMOT](https://github.com/ifzhang/FairMOT) on the [MOT Challenge dataset](https://motchallenge.net/).\n",
+    "\n",
+    "The MOT Challenge datasets are some of the most common benchmarking datasets for measuring multi-object tracking performance on pedestrian data. They provide distinct datasets every few years; their current offerings include MOT15, MOT16/17, and MOT 19/20. These datasets contain various annotated video sequences, each with different tracking difficulties. Additionally, MOT Challenge provides detections for tracking algorithms without detection components.\n",
+    "\n",
+    "The goal of this notebook is to re-produce published results on the MOT challenge using the state-of-the-art FairMOT approach."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Ensure edits to libraries are loaded and plotting is shown in the notebook.\n",
+    "%reload_ext autoreload\n",
+    "%autoreload 2\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "TorchVision: 0.4.0a0+6b959ee\n",
+      "Torch is using GPU: Tesla K80\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import os.path as osp\n",
+    "import sys\n",
+    "import time\n",
+    "\n",
+    "from urllib.parse import urljoin\n",
+    "import torch\n",
+    "import torchvision\n",
+    "\n",
+    "sys.path.append(\"../../\")\n",
+    "from utils_cv.common.data import data_path, download, unzip_url\n",
+    "from utils_cv.common.gpu import which_processor, is_windows\n",
+    "from utils_cv.tracking.data import Urls\n",
+    "from utils_cv.tracking.dataset import TrackingDataset\n",
+    "from utils_cv.tracking.model import TrackingLearner\n",
+    "\n",
+    "# Change matplotlib backend so that plots are shown for windows\n",
+    "if is_windows():\n",
+    "    plt.switch_backend(\"TkAgg\")\n",
+    "\n",
+    "print(f\"TorchVision: {torchvision.__version__}\")\n",
+    "which_processor()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above torchvision command displays your machine's GPUs (if it has any) and the compute that `torch/torchvision` is using."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we will set some model runtime parameters. Here we will specify the default FairMOT model (dla34) and will evaluate against the MOT17 dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Using torch device: cuda\n"
+     ]
+    }
+   ],
+   "source": [
+    "CONF_THRES = 0.4\n",
+    "TRACK_BUFFER = 30\n",
+    "\n",
+    "# Downloaded MOT Challendage data path\n",
+    "MOT_ROOT_PATH = \"../../data/\"\n",
+    "RESULT_ROOT = \"./results\"\n",
+    "EXP_NAME = \"MOT_val_all_dla34\"\n",
+    "\n",
+    "BASELINE_MODEL = \"./models/all_dla34.pth\"\n",
+    "MOTCHALLENGE_BASE_URL = \"https://motchallenge.net/data/\"\n",
+    "\n",
+    "# train on the GPU or on the CPU, if a GPU is not available\n",
+    "device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n",
+    "print(f\"Using torch device: {device}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model and Dataset Initialization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we will download the [MOT17](https://motchallenge.net/data/MOT17.zip) dataset and save it to `MOT_SAVED_PATH`. Note: MOT17 is around 5GB and it may take some time to download."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training data saved to ../../data/MOT17/train\n",
+      "Test data saved to ../../data/MOT17/test\n"
+     ]
+    }
+   ],
+   "source": [
+    "mot_path = urljoin(MOTCHALLENGE_BASE_URL, \"MOT17.zip\")\n",
+    "mot_train_path = osp.join(MOT_ROOT_PATH, \"MOT17\", \"train\")\n",
+    "mot_test_path = osp.join(MOT_ROOT_PATH, \"MOT17\", \"test\")\n",
+    "# seqs_str:  various video sequences subfolder names under MOT challenge data\n",
+    "train_seqs = [\n",
+    "    \"MOT17-02-SDP\",\n",
+    "    \"MOT17-04-SDP\",\n",
+    "    \"MOT17-05-SDP\",\n",
+    "    \"MOT17-09-SDP\",\n",
+    "    \"MOT17-10-SDP\",\n",
+    "    \"MOT17-11-SDP\",\n",
+    "    \"MOT17-13-SDP\",\n",
+    "]\n",
+    "test_seqs = [\n",
+    "    \"MOT17-01-SDP\",\n",
+    "    \"MOT17-03-SDP\",\n",
+    "    \"MOT17-06-SDP\",\n",
+    "    \"MOT17-07-SDP\",\n",
+    "    \"MOT17-08-SDP\",\n",
+    "    \"MOT17-12-SDP\",\n",
+    "    \"MOT17-14-SDP\",\n",
+    "]\n",
+    "\n",
+    "unzip_url(mot_path, dest=MOT_ROOT_PATH, exist_ok=True)\n",
+    "print(f\"Training data saved to {mot_train_path}\")\n",
+    "print(f\"Test data saved to {mot_test_path}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The pre-trained, baseline FairMOT model - `all_dla34.pth`- can be downloaded [here](https://drive.google.com/file/d/1udpOPum8fJdoEQm6n0jsIgMMViOMFinu/view). \n",
+    "\n",
+    "Please upload and save `all_dla34.pth` to the `BASELINE_MODEL` path."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The code below initializes and loads the model using the TrackingLearner class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tracker = TrackingLearner(None, BASELINE_MODEL)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluate on Training Set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "MOT17 provides ground truth annotations for only the training set, so we will be using the training set for evaluation.\n",
+    "\n",
+    "To evaluate FairMOT on this dataset, we take advantage of the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-02-SDP.txt\n",
+      "Evaluate seq: MOT17-02-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-04-SDP.txt\n",
+      "Evaluate seq: MOT17-04-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-05-SDP.txt\n",
+      "Evaluate seq: MOT17-05-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-09-SDP.txt\n",
+      "Evaluate seq: MOT17-09-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-10-SDP.txt\n",
+      "Evaluate seq: MOT17-10-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-11-SDP.txt\n",
+      "Evaluate seq: MOT17-11-SDP\n",
+      "loaded ./models/all_dla34.pth, epoch 10\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-13-SDP.txt\n",
+      "Evaluate seq: MOT17-13-SDP\n",
+      "              IDF1   IDP   IDR  Rcll  Prcn  GT  MT  PT ML   FP    FN IDs    FM  MOTA  MOTP IDt IDa IDm\n",
+      "MOT17-02-SDP 63.9% 77.4% 54.4% 68.5% 97.4%  62  22  31  9  344  5855 183   656 65.7% 0.193  98  33  13\n",
+      "MOT17-04-SDP 83.7% 86.1% 81.3% 86.3% 91.4%  83  51  20 12 3868  6531  32   201 78.1% 0.171   5  20   2\n",
+      "MOT17-05-SDP 75.9% 82.9% 70.1% 81.2% 96.0% 133  63  59 11  236  1303  79   207 76.6% 0.199  83  26  40\n",
+      "MOT17-09-SDP 65.4% 71.6% 60.2% 81.1% 96.5%  26  19   7  0  158  1006  52   105 77.2% 0.165  37  12   7\n",
+      "MOT17-10-SDP 65.0% 72.1% 59.2% 78.8% 96.0%  57  32  25  0  418  2721 149   404 74.4% 0.213  89  43  14\n",
+      "MOT17-11-SDP 85.6% 87.9% 83.4% 90.4% 95.2%  75  52  19  4  426   910  38   134 85.4% 0.157  24  19  13\n",
+      "MOT17-13-SDP 77.0% 82.0% 72.5% 83.9% 94.8% 110  74  29  7  534  1878  86   373 78.5% 0.205  72  26  35\n",
+      "OVERALL      76.8% 82.3% 71.9% 82.0% 93.9% 546 313 190 43 5984 20204 619  2080 76.1% 0.182 408 179 124\n"
+     ]
+    }
+   ],
+   "source": [
+    "strsummary = tracker.eval_mot(\n",
+    "    conf_thres=CONF_THRES,\n",
+    "    track_buffer=TRACK_BUFFER,    \n",
+    "    data_root=mot_train_path,\n",
+    "    seqs=train_seqs,\n",
+    "    result_root=RESULT_ROOT,\n",
+    "    exp_name=EXP_NAME,\n",
+    ")\n",
+    "print(strsummary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluate on Test Set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For evaluating a model on the testing dataset, the MOT Challenge provides the [MOT evaluation server](https://motchallenge.net/instructions/). Here, a user can upload and submit a txt file of prediction results; the service will return metrics. After uploading our results to the [MOT17 evaluation server](https://motchallenge.net/results/MOT17/?det=Private), we can see a MOTA of 68.5 using the 'all_dla34.pth' baseline model.\n",
+    "\n",
+    "<img src=\"media/mot_results.PNG\" style=\"width: 737.5px;height: 365px\"/>\n",
+    "\n",
+    "The reported evaluation results from [FairMOT paper](https://arxiv.org/abs/2004.01888) with test set are as follows:\n",
+    "\n",
+    "| Dataset | MOTA   | IDF1 | IDS | MT | ML | FPS |\n",
+    "|------|------|------|------|------|------|------|\n",
+    "|   MOT16  | 68.7| 70.4| 953| 39.5%| 19.0%| 25.9|\n",
+    "|   MOT17  | 67.5| 69.8| 2868| 37.7%| 20.8%| 25.9|"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-01-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-03-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-06-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-07-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-08-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-12-SDP.txt\n",
+      "Saved tracking results to ./results/MOT_val_all_dla34/MOT17-14-SDP.txt\n"
+     ]
+    }
+   ],
+   "source": [
+    "tracker.eval_mot(\n",
+    "    conf_thres=CONF_THRES,\n",
+    "    track_buffer=TRACK_BUFFER,    \n",
+    "    data_root=mot_test_path,\n",
+    "    seqs=test_seqs,\n",
+    "    result_root=RESULT_ROOT,\n",
+    "    exp_name=EXP_NAME,\n",
+    "    run_eval=False,\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "356.258px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/scenarios/tracking/FAQ.md
+++ b/scenarios/tracking/FAQ.md
@ -1,81 +1,113 @@
 # Multi-Object Tracking

-```diff
-+ June 2020: This work is ongoing.
-```
-
 ## Frequently asked questions

-This document tries to answer frequent questions related to multi-object tracking. For generic Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).
-
-* General
-  * [Why FairMOT repository for the tracking algorithm?](#why-FAIRMOT)
-  * [What are additional complexities that can enhance the current MOT algorithm](#What-are-additional-complexities-that-can-enhance-the-current-MOT-algorithm)
-  * [What is the difference between online and offline (batch) tracking algorithms?](#What-is-the-difference-between-online-and-offline-tracking-algorithms)
+This document includes answers and information relating to common questions and topics regarding multi-object tracking. For more general Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).

 * Data  
-  * [How to annotate a video for evaluation?](#how-to-annotate-a-video-for-evaluation)
-  * [What is the MOT Challenge format used by the evaluation package?](#What-is-the-MOT-Challenge-format-used-by-the-evaluation-package)
-
-* Technology State-of-the-Art (SoTA)
-  * [What is the architecture of the FairMOT tracking algorithm?](#What-is-the-architecture-of-the-FairMOT-tracking-algorithm)
-  * [What are SoTA object detectors used in tracking-by-detection trackers?](#What-are-SoTA-object-detectors-used-in-tracking-by-detection-trackers) 
-  * [What are SoTA feature extraction techniques used in tracking-by-detection trackers?](#What-are-SoTA-feature-extraction-techniques-used-in-tracking-by-detection-trackers)
-  * [What are SoTA affinity and association techniques used in tracking-by-detection trackers?](#What-are-SoTA-affinity-and-association-techniques-used-in-tracking-by-detection-trackers)
-  * [What are the main evaluation metrics for tracking performance?](#What-are-the-main-evaluation-metrics)
+  * [How to annotate images?](#how-to-annotate-images)

 * Training and Inference
-  * [How to improve training accuracy?](#how-to-improve-training-accuracy)
-  * [What are the main training parameters in FairMOT](#what-are-the-main-training-parameters-in-FairMOT)
-  * [What are the main inference parameters in FairMOT?](#What-are-the-main-inference-parameters-in-FairMOT])
-  * [What are the training losses for MOT using FairMOT?](#What-are-the-training-losses-for-MOT-using-FairMOT? )
+  * [What are the training losses in FairMOT?](#what-are-the-training-losses-in-fairmot)
+  * [What are the main inference parameters in FairMOT?](#what-are-the-main-inference-parameters-in-fairmot)

-* MOT Challenge
-  * [What is the MOT Challenge?](#What-is-the-MOT-Challenge)
+* Evaluation
+  * [What is the MOT Challenge?](#what-is-the-mot-challenge)
+  * [What are the commonly used evaluation metrics?](#what-are-the-commonly-used-evaluation-metrics)
+
+* State-of-the-Art(SoTA) Technology
+  * [Popular MOT datasets](#popular-mot-datasets)
+  * [What is the architecture of the FairMOT tracking algorithm?](#what-is-the-architecture-of-the-fairmot-tracking-algorithm)
+  * [What object detectors are used in tracking-by-detection trackers?](#what-object-detectors-are-used-in-tracking-by-detection-trackers)
+  * [What feature extraction techniques are used in tracking-by-detection trackers?](#what-feature-extraction-techniques-are-used-in-tracking-by-detection-trackers)
+  * [What affinity and association techniques are used in tracking-by-detection trackers?](#what-affinity-and-association-techniques-are-used-in-tracking-by-detection-trackers)
+  * [What is the difference between online and offline (batch) tracking algorithms?](#what-is-the-difference-between-online-and-offline-tracking-algorithms)
+
+* [Popular Publications and Datasets]
+  * [Popular Datasets](#popular-datasets)
+  * [Popular Publications](#popular-publications)



-
-## General
-
-### Why FairMOT?
-FairMOT is an [open-source](https://github.com/ifzhang/FairMOT) online one-shot tracking algorithm, that has shown [competitive performance in recent MOT benchmarking challenges](https://motchallenge.net/method/MOT=3015&chl=5), at fast inference speed.  
-
-
-### What are additional complexities that can enhance the current MOT algorithm?
-Multi-camera processing, and compensation for camera-movement effect on association features with epipolar geometry.  
-
-
-### What is the difference between online and offline tracking algorithms? 
-These algorithms differ at the data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames, thus existing tracks are extended or new tracks are created. In offline (batch) tracking , all observations in a batch of frames are considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking is slow, hence not suitable for online tasks such as for autonomous driving. Recently, research has focused on online tracking algorithms, which have reached the performance of offline-tracking while still maintaining high inference speed. 
-
-<p align="center">
-<img src="./media/fig_onlineBatch.jpg" width="400" align="center"/>
-</p>
-
 ## Data

-### How to annotate a video for evaluation?
-We can use an annotation tool, such as VOTT, to annotate a video for ground-truth. For example, for the evaluation video, we can draw bounding boxes around the 2 cans, and tag them as `can_1` and `can_2`: 
-<p align="center">
-<img src="./media/carcans_vott_ui.jpg" width="600" align="center"/>
-</p>
+### How to annotate images?

-Before annotating, make sure to set the extraction rate to match that of the video. After annotation, you can export the annotation results into csv form. You will end up with the extracted frames as well as a csv file containing the bounding box and id info: ``` [image] [xmin] [y_min] [x_max] [y_max] [label]```
+For training we use the exact same annotation format as for object detection (see this [FAQ](https://github.com/microsoft/computervision-recipes/blob/master/scenarios/detection/FAQ.md#how-to-annotate-images)). This also means that we train from individual frames, without taking temporal location of these frames into account.

-### What is the MOT Challenge format used by the evaluation package?
-The evaluation package, from  the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository, requires the ground-truth data to be in [MOT challenge](https://motchallenge.net/) format, i.e.: 
+For evaluation, we follow the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository which requires the ground-truth data to be in [MOT challenge](https://motchallenge.net/) format. The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation:
 ```
 [frame number] [id number] [bbox left] [bbox top] [bbox width] [bbox height][confidence score][class][visibility]
 ```
-The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation.


-## Technology State-of-the-Art (SoTA)
+See below an example where we use [VOTT](#https://github.com/microsoft/VoTT) to annotate the two cans in the image as `can_1` and `can_2` where `can_1` refers to the white/yellow can and `can_2` refers to the red can. Before annotating, it is important to correctly set the extraction rate to match that of the video. After annotation, you can export the annotation results into several forms, such as PASCAL VOC or .csv form. For the .csv format, VOTT would return the extracted frames, as well as a csv file containing the bounding box and id info:
+```
+[image] [xmin] [y_min] [x_max] [y_max] [label]
+```

+<p align="center">
+<img src="./media/carcans_vott_ui.jpg" width="800" align="center"/>
+</p>
+
+Under the hood (not exposed to the user) the FairMOT repository uses this annotation format for training where each line describes a bounding box as follows, as described in the [Towards-Realtime-MOT](https://github.com/Zhongdao/Towards-Realtime-MOT) repository:
+```
+[class] [identity] [x_center] [y_center] [width] [height]
+```
+
+The `class` field is set to 0, for all, as only single-class multi-object tracking is currently supported by the FairMOT repo (e.g. cans). The field identity is an integer from 0 to num_identities - 1 which maps class names to integers (e.g. coke can, coffee can, etc). The values of [x_center] [y_center] [width] [height] are normalized by the width/height of the image, and range from 0 to 1.
+
+
+
+## Training and inference
+
+### What are the training losses in FairMOT?
+
+Losses generated by FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the [FairMOT paper](https://arxiv.org/pdf/2004.01888v2.pdf).
+
+### What are the main inference parameters in FairMOT?
+
+- input_w and input_h: image resolution of the dataset video frames
+- conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
+- track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.
+
+
+
+
+## Evaluation
+
+### What is the MOT Challenge?
+The [MOT Challenge](#https://motchallenge.net/) website hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.
+
+
+### What are the commonly used evaluation metrics?
+As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance. Based on how they are computed, metrics can be event-based [CLEARMOT metrics](https://link.springer.com/content/pdf/10.1155/2008/246309.pdf) or [id-based metrics](https://arxiv.org/pdf/1609.01775.pdf). The main metrics used to gauge performance in the [MOT benchmarking challenge](https://motchallenge.net/results/MOT16/) include MOTA, IDF1, and ID-switch.
+
+* MOTA (Multiple Object Tracking Accuracy) gauges overall accuracy performance using an event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN (false negative), and id-switches (IDSW) normalized over the total number of ground-truth (GT) tracks.
+
+<p align="center">
+<img src="./media/eqn_mota.jpg" width="300" align="center"/>
+</p>
+
+* IDF1 measures overall performance with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR).
+
+<p align="center">
+<img src="./media/eqn_idf1.jpg" width="450" align="center"/>
+</p>
+
+* ID-switch measures when the tracker incorrectly changes the ID of a trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which was previously tagged as person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch.
+
+<p align="center">
+<img src="./media/fig_tracksEval.jpg" width="600" align="center"/>
+</p>
+
+
+
+
+## State-of-the-Art

 ### What is the architecture of the FairMOT tracking algorithm?
-It consists of a single encoder-decoder neural network which extracts high resolution feature maps of the image frame. As a one-shot tracker, these feed into two parallel heads for predicting bounding boxes and re-id features respectively, see [source](https://arxiv.org/pdf/2004.01888v2.pdf): 
+It consists of a single encoder-decoder neural network that extracts high resolution feature maps of the image frame. As a one-shot tracker, it feeds into two parallel heads for predicting bounding boxes and re-id features respectively, see [source](https://arxiv.org/pdf/2004.01888v2.pdf):
 <p align="center">
 <img src="./media/figure_fairMOTarc.jpg" width="800" align="center"/>
 </p>
@ -87,16 +119,16 @@ Source: [Zhang, 2020](https://arxiv.org/pdf/2004.01888v2.pdf)
 </center>


-### What are SoTA object detectors used in tracking-by-detection trackers?
+### What object detectors are used in tracking-by-detection trackers?
 The most popular object detectors used by SoTA tacking algorithms include: [Faster R-CNN](https://arxiv.org/pdf/1506.01497.pdf), [SSD](https://arxiv.org/pdf/1512.02325.pdf) and [YOLOv3](https://arxiv.org/pdf/1804.02767.pdf). Please see our [object detection FAQ page](../detection/faq.md) for more details.  


-### What are SoTA feature extraction techniques used in tracking-by-detection trackers?
-While older algorithms used local features such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have a deep-learning based feature representation. The most common deep-learning approaches use classical CNN to extract visual features, typically trained on re-id datasets, such as the [MARS dataset](http://www.liangzheng.com.cn/Project/project_mars.html). The following figure is an example of a CNN used for MOT by the [DeepSORT tracker](https://arxiv.org/pdf/1703.07402.pdf):
+### What feature extraction techniques are used in tracking-by-detection trackers?
+While older algorithms used local features, such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have deep-learning based feature representations. The most common deep-learning approaches, typically trained on re-id datasets, use classical CNNs to extract visual features. One such dataset is the [MARS dataset](http://www.liangzheng.com.cn/Project/project_mars.html). The following figure is an example of a CNN used for MOT by the [DeepSORT tracker](https://arxiv.org/pdf/1703.07402.pdf):
        <p align="center">
        <img src="./media/figure_DeepSortCNN.jpg" width="600" align="center"/>
        </p>
-Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of CNN networks is used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiates them. 
+Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of identical CNN networks are used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiate them.
             <p align="center">
            <img src="./media/figure_SiameseNetwork.jpg" width="400" align="center"/>
            </p>
@ -106,7 +138,7 @@ Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or C

 </center>

-In LSTM network, extracted features from different detections in different time frames are used as inputs to a LSTM network, which predicts the bounding box for the next frame based on the input history.
+In an LSTM network, extracted features from different detections in different time frames are used as inputs. The network predicts the bounding box for the next frame based on the input history.
             <p align="center">
            <img src="./media/figure_LSTM.jpg" width="550" align="center"/>
            </p>
@ -122,52 +154,55 @@ Correlation filters can also be convolved with feature maps from CNN network to
            </p>


-### What are SoTA affinity and association techniques used in tracking-by-detection trackers? 
-Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching,  Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association. 
+### What affinity and association techniques are used in tracking-by-detection trackers?
+Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching, Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association.

 In more complex deep-learning approaches, the affinity computation is often merged with feature extraction. For instance, [Siamese CNNs](https://arxiv.org/pdf/1907.12740.pdf) and [Siamese LSTMs](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w21/Wan_An_Online_and_CVPR_2018_paper.pdf) directly output the affinity score.


-### What are the main evaluation metrics?
-As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance.  Based on how they are computed, metrics can be event-based [CLEARMOT metrics](https://link.springer.com/content/pdf/10.1155/2008/246309.pdf) or [id-based metrics](https://arxiv.org/pdf/1609.01775.pdf). The main metrics used to gauge performance in the [MOT benchmarking challenge](https://motchallenge.net/results/MOT16/) include MOTA, IDF1, and ID-switch.
-* MOTA (Multiple Object Tracking Accuracy): it gauges overall accuracy performance, with event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN(false negative) and id-switches (IDSW), normalized over the total number of ground-truth (GT) tracks.
-<p align="center">
-<img src="./media/eqn_mota.jpg" width="200" align="center"/>
-</p>
-
-* IDF1: gauges overall performance, with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR): 
-<p align="center">
-<img src="./media/eqn_idf1.jpg" width="450" align="center"/>
-</p>
-
-* ID-switch: when the tracker incorrectly changes the ID of the trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which previously tagged person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch. 
+### What is the difference between online and offline tracking algorithms?
+Online and offline algorithms differ at their data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames. Thus, existing tracks are extended or new tracks are created. In offline (batch) tracking, all observations in a batch of frames can be considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking tends to be slower and hence not suitable for tasks which require real-time processing such as autonomous driving.

 <p align="center">
-<img src="./media/fig_tracksEval.jpg" width="600" align="center"/>
+<img src="./media/fig_onlineBatch.jpg" width="400" align="center"/>
 </p>




-## Training and inference
+## Popular publications and datasets
+
+### Popular Datasets
+
+<center>
+
+| Name  | Year  | Duration |	# tracks/ids | Scene | Object type |
+| ----- | ----- | -------- | --------------  | ----- |  ---------- |
+| [MOT15](https://arxiv.org/pdf/1504.01942.pdf)| 2015 | 16 min | 1221 | Outdoor | Pedestrians |
+| [MOT16/17](https://arxiv.org/pdf/1603.00831.pdf)| 2016 | 9 min | 1276 | Outdoor & indoor | Pedestrians & vehicles |
+| [CVPR19/MOT20](https://arxiv.org/pdf/1906.04567.pdf)| 2019 | 26 min | 3833 | Crowded scenes | Pedestrians & vehicles |
+| [PathTrack](http://openaccess.thecvf.com/content_ICCV_2017/papers/Manen_PathTrack_Fast_Trajectory_ICCV_2017_paper.pdf)| 2017 | 172 min | 16287 | YouTube people scenes | Persons |
+| [Visdrone](https://arxiv.org/pdf/1804.07437.pdf)| 2019 | - | - | Outdoor view from drone camera | Pedestrians & vehicles |
+| [KITTI](http://www.jimmyren.com/papers/rrc_kitti.pdf)| 2012 | 32 min | - | Traffic scenes from car camera | Pedestrians & vehicles |
+| [UA-DETRAC](https://arxiv.org/pdf/1511.04136.pdf) | 2015 | 10h | 8200 | Traffic scenes | Vehicles |
+| [CamNeT](https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2019-02/egpaper_final.pdf) | 2015 | 30 min | 30 | Outdoor & indoor | Persons |
+
+</center>


-### What are the main training parameters in FairMOT?
-The main training parameters include batch size, learning rate and number of epochs. Additionally, FairMOT uses Torch's Adam algorithm as the default optimizer.
+### Popular publications

+<center>

-### How to improve training accuracy?
-One can improve the training procedure by modifying the learning rate and number of epochs.
+| Name | Year | MOT16 IDF1 | MOT16 MOTA | Inference Speed(fps) | Online/ Batch | Detector |  Feature extraction/ motion model | Affinity & Association Approach |
+| ---- | ---- | ---------- | ---------- | -------------------- | ------------- | -------- | -------------------------------- | -------------------- |
+|[A Simple Baseline for Multi-object Tracking -FairMOT](https://arxiv.org/pdf/2004.01888.pdf)|2020|70.4|68.7|25.8|Online|One-shot tracker with detector head|One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter| JV algorithm on IOU, embedding distance,
+|[How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor](https://arxiv.org/pdf/1906.06618v2.pdf)|2020|53.4|54.8|1.6|Online|Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN|Tracktor, CNN re-id module|Deep Hungarian Net using Bi-RNN|
+|[Tracking without bells and whistles -Tracktor](https://arxiv.org/pdf/1903.05625.pdf)|2019|54.9|56.2|1.6|Online|Modified Faster-RCNN| Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN| Greedy heuristic to merge tracklets using re-id embedding distance|
+|[Towards Real-Time Multi-Object Tracking -JDE](https://arxiv.org/pdf/1909.12605v1.pdf)|2019|55.8|64.4|18.5|Online|One-shot tracker - Faster R-CNN with FPN|One-shot - Faster R-CNN with FPN, Kalman Filter|Hungarian Algorithm|
+|[Exploit the connectivity: Multi-object tracking with TrackletNet -TNT](https://arxiv.org/pdf/1811.07258.pdf)|2019|56.1|49.2|0.7|Batch|MOT challenge detections|CNN with bbox camera motion compensation, embedding feature similarity|CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization|
+|[Extending IOU based Multi-Object Tracking by Visual Information -VIOU](http://elvera.nue.tu-berlin.de/typo3/files/1547Bochinski2018.pdf)|2018|56.1(VisDrone)|40.2(VisDrone)|20(VisDrone)|Batch|Mask R-CNN, CompACT|IOU|KCF to merge tracklets using greedy IOU heuristics|
+|[Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT](https://arxiv.org/pdf/1703.07402v1.pdf)|2017|62.2| 61.4|17.4|Online|Modified Faster R-CNN|CNN re-id module, IOU, Kalman Filter|Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance |
+|[Multiple people tracking by lifted multicut and person re-identification -LMP](http://openaccess.thecvf.com/content_cvpr_2017/papers/Tang_Multiple_People_Tracking_CVPR_2017_paper.pdf)|2017|51.3|48.8|0.5|Batch|[Public detections](https://arxiv.org/pdf/1610.06136.pdf)|StackeNetPose CNN re-id module|Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization|

-### What are the training losses for MOT using FairMOT?
-Losses generated by the FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the [FairMOT paper](https://arxiv.org/pdf/2004.01888v2.pdf). 
-
-### What are the main inference parameters in FairMOT?
- input_w and input_h: image resolution of the dataset video frames;
- conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
- track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.
-
-## MOT Challenge
-
-### What is the MOT Challenge?
-It hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.
+</center>
--- a/scenarios/tracking/README.md
+++ b/scenarios/tracking/README.md
@ -1,17 +1,51 @@
 # Multi-Object Tracking

-```diff
-+ June 2020: All notebooks/code in this directory is work-in-progress and might not fully execute. 
+This directory provides examples and best practices for building and inferencing multi-object tracking systems. Our goal is to enable users to bring their own datasets and to train a high-accuracy tracking model with ease. While there are many open-source trackers available, we have integrated the [FairMOT](https://github.com/ifzhang/FairMOT) tracker to this repository. The FairMOT algorithm has shown competitive tracking performance in recent MOT benchmarking challenges, while also having respectable inference speeds.
+
+
+## Setup
+
+The tracking examples in this folder only run on Linux compute targets due to constraints introduced by the [FairMOT](https://github.com/ifzhang/FairMOT) repository.
+
+The following libraries need to be installed in the `cv` conda environment before being able to run the provided notebooks:
+```
+activate cv
+conda install -c conda-forge opencv yacs lap progress
+pip install cython_bbox motmetrics
 ```

-This directory provides examples and best practices for building multi-object tracking systems. Our goal is to enable the users to bring their own datasets and train a high-accuracytracking model easily. While there are many open-source trackers available, we have implemented the [FairMOT tracker](https://github.com/ifzhang/FairMOT) specifically, as its algorithm has shown competitive tracking performance in recent MOT benchmarking challenges, at fast inference speed.
+In addition, FairMOT's DCNv2 library needs to be compiled using this step:
+```
+cd utils_cv/tracking/references/fairmot/models/networks/DCNv2  
+sh make.sh
+```
+
+
+
+## Why FairMOT?
+FairMOT is an [open-source](https://github.com/ifzhang/FairMOT), one-shot online tracking algorithm that has shown [competitive performance in recent MOT benchmarking challenges](https://motchallenge.net/method/MOT=3015&chl=5) at fast inferencing speeds.
+
+Typical tracking algorithms address the detection and feature extraction processes in distinct successive steps. Recent research -[(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf), [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf), [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf)- has moved onto combining the detection and feature embedding processes such that they are learned in a shared model (single network), particularly when both steps involving deep learning models. This framework is called single-shot or one-shot, and has become popular in recent, high-performing models, such as FairMOT [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf), JDE [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf) and TrackRCNN [(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf). Such single-shot models are more efficient than typical tracking-by-detection models and have shown faster inference speeds due to the shared computation of the single network representation of the detection and feature embedding. On the [MOT16 Challenge dataset](https://motchallenge.net/results/MOT16/), FAIRMOT and JDE achieve 25.8 frames per seconds (fps) and 18.5 fps respectively, while DeepSORT_2, a tracking-by-detection tracker, achieves 17.4 fps.
+
+As seen in the table below, the FairMOT model has improved tracking performance when compared to standard MOT trackers (please see the [below](#What-are-the-commonly-used-evaluation-metrics) for more details on performance metrics). The JDE model, which FairMOT builds off, has a much worse ID-switch number [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf). The JDE model uses a typical anchor-based object detector network for feature embedding with a down sampled feature map. This leads to a misalignment between the anchors and the object center, therefore causing re-iding issues. FairMOT solves these issues by estimating the object center instead of the anchors, using a higher resolution feature map for object detection and feature embedding, and by aggregating high-level and low-level features to handle scale variations across different sizes of objects.
+
+<center>
+
+| Tracker  | MOTA | IDF1 |	ID-Switch | fps |
+| -------- | ---- | ---- | ---------  | --- |
+|DeepSORT_2| 61.4 | 62.2 | 781 | 17.4 |
+|JDE| 64.4 | 55.8 | 1544 | 18.5 |
+|FairMOT| 68.7 | 70.4 | 953 | 25.8 |
+
+</center>
+

 ## Technology
-Multi-object-tracking (MOT) is one of the hot research topics in Computer Vision, due to its wide applications in autonomous driving, traffic surveillance, etc. It builds on object detection technology, in order to detect and track all objects in a dynamic scene over time. Inferring target trajectories correctly across successive image frames remains challenging: occlusion happens when objects overlap; the number of and appearance of objects can change. Compared to object detection algorithms, which aim to output rectangular bounding boxes around the objects, MOT algorithms additionally associated an ID number to each box to identify that specific object across the image frames.
+Due to its applications in autonomous driving, traffic surveillance, etc., multi-object-tracking (MOT) is a popular and growing area of reseach within Computer Vision. MOT builds on object detection technology to detect and track objects in a dynamic scene over time. Inferring target trajectories correctly across successive image frames remains challenging. For example, occlusion can cause the number and appearance of objects to change, resulting in complications for MOT algorithms. Compared to object detection algorithms, which aim to output rectangular bounding boxes around the objects, MOT algorithms additionally associated an ID number to each box to identify that specific object across the image frames.

 As seen in the figure below ([Ciaparrone, 2019](https://arxiv.org/pdf/1907.12740.pdf)), a typical multi-object-tracking algorithm performs part or all of the following steps:
-* Detection: Given the input raw image frames (step 1), the detector identifies object(s) on each image frame as bounding box(es) (step 2).
-* Feature extraction/motion prediction: For every detected object, visual appearance and motion features are extracted (step 3). Sometimes, a motion predictor (e.g. Kalman Filter) is also added to predict the next position of each tracked target.
+* Detection: Given the input raw image frames (step 1), the detector identifies object(s) in each image frame as bounding box(es) (step 2).
+* Feature extraction/motion prediction: For every detected object, visual appearance and motion features are extracted (step 3). A motion predictor (e.g. Kalman Filter) is occasionally also added to predict the next position of each tracked target.
 * Affinity: The feature and motion predictions are used to calculate similarity/distance scores between pairs of detections and/or tracklets, or the probabilities of detections belonging to a given target or tracklet (step 4).
 * Association: Based on these scores/probabilities, a specific numerical ID is assigned to each detected object as it is tracked across successive image frames (step 5).

@ -20,74 +54,20 @@ As seen in the figure below ([Ciaparrone, 2019](https://arxiv.org/pdf/1907.12740
 </p>


-## State-of-the-art (SoTA)
-
-### Tracking-by-detection (two-step) vs one-shot tracker  
-
-Typical tracking algorithms address the detection and feature extraction processes in distinct successive steps. Recent research -[(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf), [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf), [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf)- has moved onto combining the detection and feature embedding processes such that they are learned in a shared model (single network), particularly when both steps involving deep learning models. This framework is called single-shot or one-shot, and recent models include FairMOT  [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf), JDE [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf) and TrackRCNN [(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf). Such single-shot models are more efficient than typical tracking-by-detection models and have shown faster inference speeds due to the shared computation of the single network representation of the detection and feature embedding: on the [MOT16 Challenge dataset](https://motchallenge.net/results/MOT16/), FAIRMOT and JDE achieve 30 fps and 18.5 fps respectively, while DeepSORT_2, a tracking-by-detection tracker with lower performance achieves 17.4 fps.
-
-As seen in the table below, the FairMOT model  has a much improved tracking performance - MOTA, IDF1 (please see the [FAQ](FAQ.md) for more details on performance metrics)-, while the JDE model has a much worse ID-switch number [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf). This is because the FairMOT model uses a typical anchor-based object detector network for feature embedding with a downsampled feature map, leading to a mis-alignment between the anchors and object center, hence re-iding issues. FairMOT solves these issues by: (i) estimating the object center instead of the anchors and using a higher resolution feature map for object detection and feature embedding, (ii) aggregating high-level and low-level features to handle scale variations across different sizes of objects.
-
-<center>
-
-| Tracker  | MOTA | IDF1 |	ID-Switch | fps |
-| -------- | ---- | ---- | ---------  | --- |
-|DeepSORT_2| 61.4 | 62.2 | 781 | 17.4 |
-|JDE| 64.4 | 64.4 | 55.8 |1544 | 18.5 |
-|FairMOT| 68.7 | 70.4 | 953 | 25.8 |
-
-</center>
-
-
-### Popular datasets
-
-<center>
-
-| Name  | Year  | Duration |	# tracks/ids | Scene | Object type |
-| ----- | ----- | -------- | --------------  | ----- |  ---------- |
-| [MOT15](https://arxiv.org/pdf/1504.01942.pdf)| 2015 | 16 min | 1221 | Outdoor | Pedestrians |
-| [MOT16/17](https://arxiv.org/pdf/1603.00831.pdf)| 2016 | 9 min | 1276 | Outdoor & indoor | Pedestrians & vehicles |
-| [CVPR19/MOT20](https://arxiv.org/pdf/1906.04567.pdf)| 2019 | 26 min | 3833 | Crowded scenes | Pedestrians & vehicles |
-| [PathTrack](http://openaccess.thecvf.com/content_ICCV_2017/papers/Manen_PathTrack_Fast_Trajectory_ICCV_2017_paper.pdf)| 2017 | 172 min | 16287 | Youtube people scenes | Persons |
-| [Visdrone](https://arxiv.org/pdf/1804.07437.pdf)| 2019 | - | - | Outdoor view from drone camera | Pedestrians & vehicles |
-| [KITTI](http://www.jimmyren.com/papers/rrc_kitti.pdf)| 2012 | 32 min | - | Traffic scenes from car camera | Pedestrians & vehicles |
-| [UA-DETRAC](https://arxiv.org/pdf/1511.04136.pdf) | 2015 | 10h | 8200 | Traffic scenes | Vehicles |
-| [CamNeT](https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2019-02/egpaper_final.pdf) | 2015 | 30 min | 30 | Outdoor & indoor | Persons |
-
-</center>
-
-### Popular publications
-
-| Name | Year | MOT16 IDF1 | MOT16 MOTA | Inference Speed(fps) | Online/ Batch | Detector |  Feature extraction/ motion model | Affinity & Association Approach |
-| ---- | ---- | ---------- | ---------- | -------------------- | ------------- | -------- | -------------------------------- | -------------------- |
-|[A Simple Baseline for Multi-object Tracking -FairMOT](https://arxiv.org/pdf/2004.01888.pdf)|2020|70.4|68.7|25.8|Online|One-shot tracker with detector head|One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter| JV algorithm on IOU, embedding distance,
-|[How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor](https://arxiv.org/pdf/1906.06618v2.pdf)|2020|53.4|54.8|1.6|Online|Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN|Tracktor, CNN re-id module|Deep Hungarian Net using Bi-RNN|
-|[Tracking without bells and whistles -Tracktor](https://arxiv.org/pdf/1903.05625.pdf)|2019|54.9|56.2|1.6|Online|Modified Faster-RCNN| Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN| Greedy heuristic to merge tracklets using re-id embedding distance|
-|[Towards Real-Time Multi-Object Tracking -JDE](https://arxiv.org/pdf/1909.12605v1.pdf)|2019|55.8|64.4|18.5|Online|One-shot tracker - Faster R-CNN with FPN|One-shot - Faster R-CNN with FPN, Kalman Filter|Hungarian Algorithm|
-|[Exploit the connectivity: Multi-object tracking with TrackletNet -TNT](https://arxiv.org/pdf/1811.07258.pdf)|2019|56.1|49.2|0.7|Batch|MOT challenge detections|CNN with bbox camera motion compensation, embedding feature similarity|CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization|
-|[Extending IOU based Multi-Object Tracking by Visual Information -VIOU](http://elvera.nue.tu-berlin.de/typo3/files/1547Bochinski2018.pdf)|2018|56.1(VisDrone)|40.2(VisDrone)|20(VisDrone)|Batch|Mask R-CNN, CompACT|IOU|KCF to merge tracklets using greedy IOU heuristics|
-|[Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT](https://arxiv.org/pdf/1703.07402v1.pdf)|2017|62.2| 61.4|17.4|Online|Modified Faster R-CNN|CNN re-id module, IOU, Kalman Filter|Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance |
-|[Multiple people tracking by lifted multicut and person re-identification -LMP](http://openaccess.thecvf.com/content_cvpr_2017/papers/Tang_Multiple_People_Tracking_CVPR_2017_paper.pdf)|2017|51.3|48.8|0.5|Batch|[Public detections](https://arxiv.org/pdf/1610.06136.pdf)|StackeNetPose CNN re-id module|Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization|
-
-
 ## Notebooks

 We provide several notebooks to show how multi-object-tracking algorithms can be designed and evaluated:

 | Notebook name | Description |
 | --- | --- |
-| [00_webcam.ipynb](./00_webcam.ipynb)| Quick-start notebook which demonstrates how to build an object tracking system using a single video or webcam as input.
-| [01_training_introduction.ipynb](./01_training_introduction.ipynb)| Notebook which explains the basic concepts around model training, inferencing, and evaluation using typical tracking performance metrics.|
-| [02_mot_challenge.ipynb](./02_mot_challenge.ipynb) | Notebook which runs inference on a large dataset, the MOT challenge dataset. |
+| [01_training_introduction.ipynb](./01_training_introduction.ipynb)| Notebook that explains the basic concepts around model training, inferencing, and evaluation using typical tracking performance metrics.|
+| [02_mot_challenge.ipynb](./02_mot_challenge.ipynb) | Notebook that runs model inference on the commonly used MOT Challenge dataset. |


+## Frequently Asked Questions

-## Frequently asked questions
+Answers to frequently asked questions, such as "How does the technology work?" or "What data formats are required?", can be found in the [FAQ](FAQ.md) located in this folder. For generic questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see the [FAQ.md](../classification/FAQ.md) in the classification folder.

-Answers to frequently asked questions such as "How does the technology work?", "What data formats are required?" can be found in the [FAQ](FAQ.md) located in this folder. For generic questions such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see the [FAQ.md](../classification/FAQ.md) in the classification folder.
-
-
-
-## Contribution guidelines
+## Contribution Guidelines

 See the [contribution guidelines](../../CONTRIBUTING.md) in the root folder.
--- a/scenarios/tracking/media/mot_results.PNG
+++ b/scenarios/tracking/media/mot_results.PNG
--- a/utils_cv/tracking/data.py
+++ b/utils_cv/tracking/data.py
@ -8,9 +8,9 @@ from urllib.parse import urljoin
 class Urls:
    base = "https://cvbp.blob.core.windows.net/public/datasets/tracking/"

+    cans_path = urljoin(base, "cans.zip")
    fridge_objects_path = urljoin(base, "odFridgeObjects_FairMOT-Format.zip")
    carcans_annotations_path = urljoin(base, "carcans_vott-csv-export.zip")
-    carcans_video_path = urljoin(base, "car_cans_8s.mp4")

    @classmethod
    def all(cls) -> List[str]:
--- a/utils_cv/tracking/dataset.py
+++ b/utils_cv/tracking/dataset.py
@ -2,44 +2,71 @@
 # Licensed under the MIT License.

 from collections import OrderedDict
-import numpy as np
+from functools import partial
 import os
 import os.path as osp
+from pathlib import Path
+import random
+import tempfile
 from typing import Dict, List
+
+import numpy as np
+from PIL import Image
 from torch.utils.data import DataLoader
 from torchvision.transforms import transforms as T
+
 from .bbox import TrackingBbox
-from .references.fairmot.datasets.dataset.jde import JointDataset
 from .opts import opts
+from .references.fairmot.datasets.dataset.jde import JointDataset
 from ..common.gpu import db_num_workers
+from ..detection.dataset import parse_pascal_voc_anno
+from ..detection.plot import plot_detections, plot_grid


 class TrackingDataset:
    """A multi-object tracking dataset."""

    def __init__(
-        self, data_root: str, name: str = "default", batch_size: int = 12,
+        self,
+        root: str,
+        name: str = "default",
+        batch_size: int = 12,
+        im_dir: str = "images",
+        anno_dir: str = "annotations",
    ) -> None:
        """
        Args:
            data_root: root data directory containing image and annotation subdirectories
            name: user-friendly name for the dataset
            batch_size: batch size
+            anno_dir: the name of the annotation subfolder under the root directory
+            im_dir: the name of the image subfolder under the root directory.
        """
-        transforms = T.Compose([T.ToTensor()])
+        self.root = root
+        self.name = name
        self.batch_size = batch_size
+        self.im_dir = Path(im_dir)
+        self.anno_dir = Path(anno_dir)

+        # set these to None so taht can use the 'plot_detections' function
+        self.keypoints = None
+        self.mask_paths = None
+
+        # Init FairMOT opt object with all parameter settings
        opt = opts()

-        train_list_path = osp.join(data_root, "{}.train".format(name))
-        with open(train_list_path, "a") as f:
-            for im_name in sorted(os.listdir(osp.join(data_root, "images"))):
-                f.write(osp.join("images", im_name) + "\n")
+        # Read annotations
+        self._read_annos()

+        # Save annotation in FairMOT format
+        self._write_fairMOT_format()
+
+        # Create FairMOT dataset object
+        transforms = T.Compose([T.ToTensor()])
        self.train_data = JointDataset(
-            opt.opt,
-            data_root,
-            {name: train_list_path},
+            opt,
+            self.root,
+            {name: self.fairmot_imlist_path},
            (opt.input_w, opt.input_h),
            augment=True,
            transforms=transforms,
@ -57,31 +84,155 @@ class TrackingDataset:
            pin_memory=True,
            drop_last=True,
        )
-        
+
+    def _read_annos(self) -> None:
+        """ Parses all Pascal VOC formatted annotation files to extract all
+        possible labels. """
+        # All annotation files are assumed to be in the anno_dir directory,
+        # and images in the im_dir directory
+        self.im_filenames = sorted(os.listdir(self.root / self.im_dir))
+        im_paths = [
+            os.path.join(self.root / self.im_dir, s) for s in self.im_filenames
+        ]
+        anno_filenames = [
+            os.path.splitext(s)[0] + ".xml" for s in self.im_filenames
+        ]
+
+        # Read all annotations
+        self.im_paths = []
+        self.anno_paths = []
+        self.anno_bboxes = []
+        for anno_idx, anno_filename in enumerate(anno_filenames):
+            anno_path = self.root / self.anno_dir / str(anno_filename)
+
+            # Parse annotation file
+            anno_bboxes, _, _ = parse_pascal_voc_anno(anno_path)
+
+            # Store annotation info
+            self.im_paths.append(im_paths[anno_idx])
+            self.anno_paths.append(anno_path)
+            self.anno_bboxes.append(anno_bboxes)
+        assert len(self.im_paths) == len(self.anno_paths)
+
+        # Get list of all labels
+        labels = []
+        for anno_bboxes in self.anno_bboxes:
+            for anno_bbox in anno_bboxes:
+                if anno_bbox.label_name is not None:
+                    labels.append(anno_bbox.label_name)
+        self.labels = list(set(labels))
+
+        # Set for each bounding box label name also what its integer representation is
+        for anno_bboxes in self.anno_bboxes:
+            for anno_bbox in anno_bboxes:
+                if anno_bbox.label_name is None:
+                    # background rectangle is assigned id 0 by design
+                    anno_bbox.label_idx = 0
+                else:
+                    label = self.labels.index(anno_bbox.label_name) + 1
+                    anno_bbox.label_idx = label
+
+        # Get image sizes. Note that Image.open() only loads the image header,
+        # not the full images and is hence fast.
+        self.im_sizes = np.array([Image.open(p).size for p in self.im_paths])
+
+    def _write_fairMOT_format(self) -> None:
+        """ Write bounding box information in the format FairMOT expects for training."""
+        fairmot_annos_dir = os.path.join(self.root, "labels_with_ids")
+        os.makedirs(fairmot_annos_dir, exist_ok=True)
+
+        # Create for each image a annotation .txt file in FairMOT format
+        for filename, bboxes, im_size in zip(
+            self.im_filenames, self.anno_bboxes, self.im_sizes
+        ):
+            im_width = float(im_size[0])
+            im_height = float(im_size[1])
+            fairmot_anno_path = os.path.join(
+                fairmot_annos_dir, filename[:-4] + ".txt"
+            )
+
+            with open(fairmot_anno_path, "w") as f:
+                for bbox in bboxes:
+                    tid_curr = bbox.label_idx - 1
+                    x = round(bbox.left + bbox.width() / 2.0)
+                    y = round(bbox.top + bbox.height() / 2.0)
+                    w = bbox.width()
+                    h = bbox.height()
+
+                    label_str = "0 {:d} {:.6f} {:.6f} {:.6f} {:.6f}\n".format(
+                        tid_curr,
+                        x / im_width,
+                        y / im_height,
+                        w / im_width,
+                        h / im_height,
+                    )
+                    f.write(label_str)
+
+        # write all image filenames into a <name>.train file required by FairMOT
+        self.fairmot_imlist_path = osp.join(
+            self.root, "{}.train".format(self.name)
+        )
+        with open(self.fairmot_imlist_path, "w") as f:
+            for im_filename in sorted(self.im_filenames):
+                f.write(osp.join(self.im_dir, im_filename) + "\n")
+
+    def show_ims(self, rows: int = 1, cols: int = 3, seed: int = None) -> None:
+        """ Show a set of images.
+
+        Args:
+            rows: the number of rows images to display
+            cols: cols to display, NOTE: use 3 for best looking grid
+            seed: random seed for selecting images
+
+        Returns None but displays a grid of annotated images.
+        """
+        if seed:
+            random.seed(seed or self.seed)
+
+        def helper(im_paths):
+            idx = random.randrange(len(im_paths))
+            detection = {
+                "idx": idx,
+                "im_path": im_paths[idx],
+                "det_bboxes": [],
+            }
+            return detection, self, None, None
+
+        plot_grid(
+            plot_detections,
+            partial(helper, self.im_paths),
+            rows=rows,
+            cols=cols,
+        )
+
+
 def boxes_to_mot(results: Dict[int, List[TrackingBbox]]) -> None:
    """
    Save the predicted tracks to csv file in MOT challenge format ["frame", "id", "left", "top", "width", "height",]
-    
+
    Args:
        results: dictionary mapping frame id to a list of predicted TrackingBboxes
        txt_path: path to which results are saved in csv file
-    
-    """   
+
+    """
    # convert results to dataframe in MOT challenge format
    preds = OrderedDict(sorted(results.items()))
    bboxes = [
        [
-            bb.frame_id,
+            bb.frame_id + 1,
            bb.track_id,
-            bb.top,
            bb.left,
-            bb.bottom - bb.top,
+            bb.top,
            bb.right - bb.left,
-            1, -1, -1, -1,
+            bb.bottom - bb.top,
+            1,
+            -1,
+            -1,
+            -1,
        ]
        for _, v in preds.items()
        for bb in v
    ]
    bboxes_formatted = np.array(bboxes)
-    
-    return bboxes_formatted
+
+    return bboxes_formatted
--- a/utils_cv/tracking/model.py
+++ b/utils_cv/tracking/model.py
@ -1,27 +1,26 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.

-import argparse
-from collections import OrderedDict, defaultdict
+from collections import defaultdict
 from copy import deepcopy
 import glob
-import requests
 import os
 import os.path as osp
-import tempfile #KIP
-from typing import Dict, List, Tuple
+from typing import Dict, List, Optional, Tuple

+import cv2
+import matplotlib.pyplot as plt
+import motmetrics as mm
+import numpy as np

 import torch
 import torch.cuda as cuda
-import torch.nn as nn
 from torch.utils.data import DataLoader

-import cv2
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import motmetrics as mm
+from .bbox import TrackingBbox
+from ..common.gpu import torch_device
+from .dataset import TrackingDataset, boxes_to_mot
+from .opts import opts

 from .references.fairmot.datasets.dataset.jde import LoadImages, LoadVideo
 from .references.fairmot.models.model import (
@ -33,54 +32,6 @@ from .references.fairmot.tracker.multitracker import JDETracker
 from .references.fairmot.tracking_utils.evaluation import Evaluator
 from .references.fairmot.trains.train_factory import train_factory

-from .bbox import TrackingBbox
-from .dataset import TrackingDataset, boxes_to_mot
-from .opts import opts
-from .plot import draw_boxes, assign_colors
-from ..common.gpu import torch_device
-
-BASELINE_URL = (
-    "https://drive.google.com/open?id=1udpOPum8fJdoEQm6n0jsIgMMViOMFinu"
-)
-
-
-def _download_baseline(url, destination) -> None:
-    """
-    Download the baseline model .pth file to the destination.
-
-    Args:
-        url: a Google Drive url of the form "https://drive.google.com/open?id={id}"
-        destination: path to save the model to
-
-    Implementation based on https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url
-    """
-
-    def get_confirm_token(response):
-        for key, value in response.cookies.items():
-            if key.startswith("download_warning"):
-                return value
-
-        return None
-
-    def save_response_content(response, destination):
-        CHUNK_SIZE = 32768
-
-        with open(destination, "wb") as f:
-            for chunk in response.iter_content(CHUNK_SIZE):
-                if chunk:  # filter out keep-alive new chunks
-                    f.write(chunk)
-
-    session = requests.Session()
-    id = url.split("id=")[-1]
-    response = session.get(url, params={"id": id}, stream=True)
-    token = get_confirm_token(response)
-    if token:
-        response = session.get(
-            url, params={"id": id, "confirm": token}, stream=True
-        )
-
-    save_response_content(response, destination)
-

 def _get_gpu_str():
    if cuda.is_available():
@ -90,47 +41,80 @@ def _get_gpu_str():
        return "-1"  # cpu


-def write_video(
-    results: Dict[int, List[TrackingBbox]], input_video: str, output_video: str
-) -> None:
-    """ 
-    Plot the predicted tracks on the input video. Write the output to {output_path}.
-
-    Args:
-        results: dictionary mapping frame id to a list of predicted TrackingBboxes
-        input_video: path to the input video
-        output_video: path to write out the output video
-    """
-    results = OrderedDict(sorted(results.items()))
-    # read video and initialize new tracking video
+def _get_frame(input_video: str, frame_id: int):
    video = cv2.VideoCapture()
    video.open(input_video)
+    video.set(cv2.CAP_PROP_POS_FRAMES, frame_id)
+    _, im = video.read()
+    im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+    return im

-    image_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
-    image_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
-    fourcc = cv2.VideoWriter_fourcc(*"MP4V")
-    frame_rate = int(video.get(cv2.CAP_PROP_FPS))
-    writer = cv2.VideoWriter(
-        output_video, fourcc, frame_rate, (image_width, image_height)
+
+def savetxt_results(
+    results: Dict[int, List[TrackingBbox]],
+    exp_name: str,
+    root_path: str,
+    result_filename: str,
+) -> str:
+    """Save tracking results to txt in provided path.
+
+    Args:
+        results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]]
+        exp_name: subfolder for each experiment
+        root_path: root path for results saved
+        result_filename: saved prediction results txt file; end with '.txt'
+    Returns:
+        result_path: saved prediction results txt file path
+    """
+
+    # Convert prediction results to mot format
+    bboxes_mot = boxes_to_mot(results)
+
+    # Save results
+    result_path = osp.join(root_path, exp_name, result_filename)
+    np.savetxt(result_path, bboxes_mot, delimiter=",", fmt="%s")
+
+    return result_path
+
+
+def evaluate_mot(gt_root_path: str, exp_name: str, result_path: str) -> object:
+    """ eval code that calls on 'motmetrics' package in referenced FairMOT script, to produce MOT metrics on inference, given ground-truth.
+    Args:
+        gt_root_path: path of dataset containing GT annotations in MOTchallenge format (xywh)
+        exp_name: subfolder for each experiment
+        result_path: saved prediction results txt file path
+    Returns:
+        mot_accumulator: MOTAccumulator object from pymotmetrics package
+    """
+    # Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
+    evaluator = Evaluator(gt_root_path, exp_name, "mot")
+
+    # Run evaluation using pymotmetrics package
+    mot_accumulator = evaluator.eval_file(result_path)
+
+    return mot_accumulator
+
+
+def mot_summary(accumulators: list, exp_names: list) -> str:
+    """Given a list of MOTAccumulators, get total summary by method in 'motmetrics', containing metrics scores
+
+    Args:
+        accumulators: list of MOTAccumulators
+        exp_names: list of experiment names (str) corresponds to MOTAccumulators
+    Returns:
+        strsummary: str output by method in 'motmetrics', containing metrics scores
+    """
+    metrics = mm.metrics.motchallenge_metrics
+    mh = mm.metrics.create()
+
+    summary = Evaluator.get_summary(accumulators, exp_names, metrics)
+    strsummary = mm.io.render_summary(
+        summary,
+        formatters=mh.formatters,
+        namemap=mm.io.motchallenge_metric_names,
    )

-    # assign bbox color per id
-    unique_ids = list(
-        set([bb.track_id for frame in results.values() for bb in frame])
-    )
-    color_map = assign_colors(unique_ids)
-
-    # create images and add to video writer, adapted from https://github.com/ZQPei/deep_sort_pytorch
-    frame_idx = 0
-    while video.grab():
-        _, cur_image = video.retrieve()
-        cur_tracks = results[frame_idx]
-        if len(cur_tracks) > 0:
-            cur_image = draw_boxes(cur_image, cur_tracks, color_map)
-        writer.write(cur_image)
-        frame_idx += 1
-
-    print(f"Output saved to {output_video}.")
+    return strsummary


 class TrackingLearner(object):
@ -138,10 +122,10 @@ class TrackingLearner(object):

    def __init__(
        self,
-        dataset: TrackingDataset,
-        model_path: str,
+        dataset: Optional[TrackingDataset] = None,
+        model_path: Optional[str] = None,
        arch: str = "dla_34",
-        head_conv: int = None,
+        head_conv: int = -1,
    ) -> None:
        """
        Initialize learner object.
@ -149,8 +133,8 @@ class TrackingLearner(object):
        Defaults to the FairMOT model.

        Args:
-            dataset: the dataset
-            model_path: path to save model
+            dataset: optional dataset (required for training)
+            model_path: optional path to pretrained model (defaults to all_dla34.pth)
            arch: the model architecture
                Supported architectures: resdcn_34, resdcn_50, resfpndcn_34, dla_34, hrnet_32
            head_conv: conv layer channels for output head. None maps to the default setting.
@ -158,25 +142,27 @@ class TrackingLearner(object):
        """
        self.opt = opts()
        self.opt.arch = arch
-        self.opt.head_conv = head_conv if head_conv else -1
-        self.opt.gpus = _get_gpu_str()
+        self.opt.set_head_conv(head_conv)
+        self.opt.set_gpus(_get_gpu_str())
        self.opt.device = torch_device()
-
        self.dataset = dataset
-        self.model = self.init_model()
-        self.model_path = model_path
+        self.model = None
+        self._init_model(model_path)

-    def init_model(self) -> nn.Module:
+    def _init_model(self, model_path) -> None:
        """
-        Download and initialize the baseline FairMOT model.
-        """
-        model_dir = osp.join(self.opt.root_dir, "models")
-        baseline_path = osp.join(model_dir, "all_dla34.pth")
-        #         os.makedirs(model_dir, exist_ok=True)
-        #         _download_baseline(BASELINE_URL, baseline_path)
-        self.opt.load_model = baseline_path
+        Initialize the model.

-        return create_model(self.opt.arch, self.opt.heads, self.opt.head_conv)
+        Args:
+            model_path: optional path to pretrained model (defaults to all_dla34.pth)
+        """
+        if not model_path:
+            model_path = osp.join(self.opt.root_dir, "models", "all_dla34.pth")
+        assert osp.isfile(
+            model_path
+        ), f"Model weights not found at {model_path}"
+
+        self.opt.load_model = model_path

    def fit(
        self, lr: float = 1e-4, lr_step: str = "20,27", num_epochs: int = 30
@ -191,87 +177,91 @@ class TrackingLearner(object):

        Raise:
            Exception if dataset is undefined
-        
+
        Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/train.py
        """
        if not self.dataset:
            raise Exception("No dataset provided")
+        if type(lr_step) is not list:
+            lr_step = [lr_step]
+        lr_step = [int(x) for x in lr_step]

-        opt_fit = deepcopy(self.opt)  # copy opt to avoid bug
-        opt_fit.lr = lr
-        opt_fit.lr_step = lr_step
-        opt_fit.num_epochs = num_epochs
+        # update parameters
+        self.opt.lr = lr
+        self.opt.lr_step = lr_step
+        self.opt.num_epochs = num_epochs
+        opt = deepcopy(self.opt)  #to avoid fairMOT over-writing opt

        # update dataset options
-        opt_fit.update_dataset_info_and_set_heads(self.dataset.train_data)
+        opt.update_dataset_info_and_set_heads(self.dataset.train_data)

        # initialize dataloader
        train_loader = self.dataset.train_dl
-
-        self.optimizer = torch.optim.Adam(self.model.parameters(), opt_fit.lr)
+        self.model = create_model(
+            opt.arch, opt.heads, opt.head_conv
+        )
+        self.model = load_model(self.model, opt.load_model)
+        self.optimizer = torch.optim.Adam(self.model.parameters(), opt.lr)
        start_epoch = 0
-        print(f"Loading {opt_fit.load_model}")
-        self.model = load_model(self.model, opt_fit.load_model)

-        Trainer = train_factory[opt_fit.task]
-        trainer = Trainer(opt_fit.opt, self.model, self.optimizer)
-        trainer.set_device(opt_fit.gpus, opt_fit.chunk_sizes, opt_fit.device)
-        
+        Trainer = train_factory[opt.task]
+        trainer = Trainer(opt, self.model, self.optimizer)
+        trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)
+
        # initialize loss vars
        self.losses_dict = defaultdict(list)
-        
+
        # training loop
        for epoch in range(
-            start_epoch + 1, start_epoch + opt_fit.num_epochs + 1
+            start_epoch + 1, start_epoch + opt.num_epochs + 1
        ):
            print(
                "=" * 5,
-                f" Epoch: {epoch}/{start_epoch + opt_fit.num_epochs} ",
+                f" Epoch: {epoch}/{start_epoch + opt.num_epochs} ",
                "=" * 5,
            )
            self.epoch = epoch
            log_dict_train, _ = trainer.train(epoch, train_loader)
            for k, v in log_dict_train.items():
-                print(f"{k}: {v}")
-            if epoch in opt_fit.lr_step:
-                lr = opt_fit.lr * (0.1 ** (opt_fit.lr_step.index(epoch) + 1))
-                for param_group in optimizer.param_groups:
+                if k == "time":
+                    print(f"{k}:{v} min")
+                else:
+                    print(f"{k}: {v}")
+            if epoch in opt.lr_step:
+                lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1))
+                for param_group in self.optimizer.param_groups:
                    param_group["lr"] = lr
-                    
-            # store losses in each epoch                   
+
+            # store losses in each epoch
            for k, v in log_dict_train.items():
-                if k in ['loss', 'hm_loss', 'wh_loss', 'off_loss', 'id_loss']:
+                if k in ["loss", "hm_loss", "wh_loss", "off_loss", "id_loss"]:
                    self.losses_dict[k].append(v)

-        # save after training because at inference-time FairMOT src reads model weights from disk
-        self.save(self.model_path)
+    def plot_training_losses(self, figsize: Tuple[int, int] = (10, 5)) -> None:
+        """
+        Plot training loss.

-    def plot_training_losses(self, figsize: Tuple[int, int] = (10, 5))->None: 
-        '''
-        Plots training loss from calling `fit`  
-        
        Args:
            figsize (optional): width and height wanted for figure of training-loss plot
-        
-        '''
+
+        """
        fig = plt.figure(figsize=figsize)
        ax1 = fig.add_subplot(1, 1, 1)
-        
-        ax1.set_xlim([0, len(self.losses_dict['loss']) - 1])
-        ax1.set_xticks(range(0, len(self.losses_dict['loss'])))        
+
+        ax1.set_xlim([0, len(self.losses_dict["loss"]) - 1])
+        ax1.set_xticks(range(0, len(self.losses_dict["loss"])))
        ax1.set_xlabel("epochs")
        ax1.set_ylabel("losses")
-        
-        ax1.plot(self.losses_dict['loss'], c="r", label='loss')
-        ax1.plot(self.losses_dict['hm_loss'], c="y", label='hm_loss')
-        ax1.plot(self.losses_dict['wh_loss'], c="g", label='wh_loss')
-        ax1.plot(self.losses_dict['off_loss'], c="b", label='off_loss')
-        ax1.plot(self.losses_dict['id_loss'], c="m", label='id_loss')

-        plt.legend(loc='upper right')
+        ax1.plot(self.losses_dict["loss"], c="r", label="loss")
+        ax1.plot(self.losses_dict["hm_loss"], c="y", label="hm_loss")
+        ax1.plot(self.losses_dict["wh_loss"], c="g", label="wh_loss")
+        ax1.plot(self.losses_dict["off_loss"], c="b", label="off_loss")
+        ax1.plot(self.losses_dict["id_loss"], c="m", label="id_loss")
+
+        plt.legend(loc="upper right")
        fig.suptitle("Training losses over epochs")
-        
-    
+
    def save(self, path) -> None:
        """
        Save the model to a specified path.
@ -282,95 +272,134 @@ class TrackingLearner(object):
        save_model(path, self.epoch, self.model, self.optimizer)
        print(f"Model saved to {path}")

-    def evaluate(self,
-                 results: Dict[int, List[TrackingBbox]],
-                 gt_root_path: str) -> str:
-        
-        """ eval code that calls on 'motmetrics' package in referenced FairMOT script, to produce MOT metrics on inference, given ground-truth.
+    def evaluate(
+        self, results: Dict[int, List[TrackingBbox]], gt_root_path: str
+    ) -> str:
+
+        """
+        Evaluate performance wrt MOTA, MOTP, track quality measures, global ID measures, and more,
+        as computed by py-motmetrics on a single experiment. By default, use 'single_vid' as exp_name.
+
        Args:
-            results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]] 
+            results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]]
            gt_root_path: path of dataset containing GT annotations in MOTchallenge format (xywh)
        Returns:
-            strsummary: str output by method in 'motmetrics' package, containing metrics scores        
+            strsummary: str output by method in 'motmetrics' package, containing metrics scores
        """
-       
-        #Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
-        evaluator = Evaluator(gt_root_path, "single_vid", "mot")
-        
-        with tempfile.TemporaryDirectory() as tmpdir1:
-            os.makedirs(osp.join(tmpdir1,'results'))
-            result_filename = osp.join(tmpdir1,'results', 'results.txt')
-          
-            # Save results im MOT format for evaluation            
-            bboxes_mot = boxes_to_mot(results)            
-            np.savetxt(result_filename, bboxes_mot, delimiter=",", fmt="%s")

-            # Run evaluation using pymotmetrics package
-            accs=[evaluator.eval_file(result_filename)]
-                           
-        # get summary
-        metrics = mm.metrics.motchallenge_metrics
-        mh = mm.metrics.create()
-       
-        summary = Evaluator.get_summary(accs, ("single_vid",), metrics)
-        strsummary = mm.io.render_summary(
-            summary,
-            formatters=mh.formatters,
-            namemap=mm.io.motchallenge_metric_names
+        # Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
+        result_path = savetxt_results(
+            results, "single_vid", gt_root_path, "results.txt"
        )
-        print(strsummary)        
-        
+        # Save tracking results in tmp
+        mot_accumulator = evaluate_mot(gt_root_path, "single_vid", result_path)
+        strsummary = mot_summary([mot_accumulator], ("single_vid",))
        return strsummary
-    
+
+    def eval_mot(
+        self,
+        conf_thres: float,
+        track_buffer: int,
+        data_root: str,
+        seqs: list,
+        result_root: str,
+        exp_name: str,
+        run_eval: bool = True,
+    ) -> str:
+        """
+        Call the prediction function, saves the tracking results to txt file and provides the evaluation results with motmetrics format.
+        Args:
+            conf_thres: confidence thresh for tracking
+            track_buffer: tracking buffer
+            data_root: data root path
+            seqs: list of video sequences subfolder names under MOT challenge data
+            result_root: tracking result path
+            exp_name: experiment name
+            run_eval: if we evaluate on provided data
+        Returns:
+            strsummary: str output by method in 'motmetrics' package, containing metrics scores
+        """
+        accumulators = []
+        eval_path = osp.join(result_root, exp_name)
+        if not osp.exists(eval_path):
+            os.makedirs(eval_path)
+
+        # Loop over all video sequences
+        for seq in seqs:
+            result_filename = "{}.txt".format(seq)
+            im_path = osp.join(data_root, seq, "img1")
+            result_path = osp.join(result_root, exp_name, result_filename)
+            with open(osp.join(data_root, seq, "seqinfo.ini")) as seqinfo_file:
+                meta_info = seqinfo_file.read()
+
+            # frame_rate is set from seqinfo.ini by frameRate
+            frame_rate = int(
+                meta_info[
+                    meta_info.find("frameRate")
+                    + 10 : meta_info.find("\nseqLength")
+                ]
+            )
+
+            # Run model inference
+            if not osp.exists(result_path):
+                eval_results = self.predict(
+                    im_or_video_path=im_path,
+                    conf_thres=conf_thres,
+                    track_buffer=track_buffer,
+                    frame_rate=frame_rate,
+                )
+                result_path = savetxt_results(
+                    eval_results, exp_name, result_root, result_filename
+                )
+                print(f"Saved tracking results to {result_path}")
+            else:
+                print(f"Loaded tracking results from {result_path}")
+
+            # Run evaluation
+            if run_eval:
+                print(f"Evaluate seq: {seq}")
+                mot_accumulator = evaluate_mot(data_root, seq, result_path)
+                accumulators.append(mot_accumulator)
+
+        if run_eval:
+            strsummary = mot_summary(accumulators, seqs)
+            return strsummary
+        else:
+            return None
+
    def predict(
        self,
        im_or_video_path: str,
        conf_thres: float = 0.6,
-        det_thres: float = 0.3,
-        nms_thres: float = 0.4,
        track_buffer: int = 30,
        min_box_area: float = 200,
-        im_size: Tuple[float, float] = (None, None),
        frame_rate: int = 30,
    ) -> Dict[int, List[TrackingBbox]]:
        """
-        Performs inferencing on an image or video path.
+        Run inference on an image or video path.

        Args:
            im_or_video_path: path to image(s) or video. Supports jpg, jpeg, png, tif formats for images.
-                Supports mp4, avi formats for video. 
+                Supports mp4, avi formats for video.
            conf_thres: confidence thresh for tracking
-            det_thres: confidence thresh for detection
-            nms_thres: iou thresh for nms
            track_buffer: tracking buffer
            min_box_area: filter out tiny boxes
-            im_size: (input height, input_weight)
            frame_rate: frame rate

        Returns a list of TrackingBboxes

        Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
        """
-        opt_pred = deepcopy(self.opt)  # copy opt to avoid bug
-        opt_pred.conf_thres = conf_thres
-        opt_pred.det_thres = det_thres
-        opt_pred.nms_thres = nms_thres
-        opt_pred.track_buffer = track_buffer
-        opt_pred.min_box_area = min_box_area
-
-        input_h, input_w = im_size
-        input_height = input_h if input_h else -1
-        input_width = input_w if input_w else -1
-        opt_pred.update_dataset_res(input_height, input_width)
+        self.opt.conf_thres = conf_thres
+        self.opt.track_buffer = track_buffer
+        self.opt.min_box_area = min_box_area
+        opt = deepcopy(self.opt)  #to avoid fairMOT over-writing opt

        # initialize tracker
-        opt_pred.load_model = self.model_path
-        tracker = JDETracker(opt_pred.opt, frame_rate=frame_rate)
+        tracker = JDETracker(opt, frame_rate=frame_rate, model=self.model)

        # initialize dataloader
-        dataloader = self._get_dataloader(
-            im_or_video_path, opt_pred.input_h, opt_pred.input_w
-        )
+        dataloader = self._get_dataloader(im_or_video_path)

        frame_id = 0
        out = {}
@ -384,9 +413,9 @@ class TrackingLearner(object):
                tlbr = t.tlbr
                tid = t.track_id
                vertical = tlwh[2] / tlwh[3] > 1.6
-                if tlwh[2] * tlwh[3] > opt_pred.min_box_area and not vertical:
+                if tlwh[2] * tlwh[3] > opt.min_box_area and not vertical:
                    bb = TrackingBbox(
-                        tlbr[1], tlbr[0], tlbr[3], tlbr[2], frame_id, tid
+                        tlbr[0], tlbr[1], tlbr[2], tlbr[3], frame_id, tid
                    )
                    online_bboxes.append(bb)
            out[frame_id] = online_bboxes
@ -394,11 +423,9 @@ class TrackingLearner(object):

        return out

-    def _get_dataloader(
-        self, im_or_video_path: str, input_h, input_w
-    ) -> DataLoader:
+    def _get_dataloader(self, im_or_video_path: str) -> DataLoader:
        """
-        Creates a dataloader from images or video in the given path.
+        Create a dataloader from images or video in the given path.

        Args:
            im_or_video_path: path to a root directory of images, or single video or image file.
@ -429,18 +456,18 @@ class TrackingLearner(object):
            )
            > 0
        ):
-            return LoadImages(im_or_video_path, img_size=(input_w, input_h))
+            return LoadImages(im_or_video_path)
        # if path is to a single video file
        elif (
            osp.isfile(im_or_video_path)
            and osp.splitext(im_or_video_path)[1] in video_format
        ):
-            return LoadVideo(im_or_video_path, img_size=(input_w, input_h))
+            return LoadVideo(im_or_video_path)
        # if path is to a single image file
        elif (
            osp.isfile(im_or_video_path)
            and osp.splitext(im_or_video_path)[1] in im_format
        ):
-            return LoadImages(im_or_video_path, img_size=(input_w, input_h))
+            return LoadImages(im_or_video_path)
        else:
            raise Exception("Image or video format not supported")
--- a/utils_cv/tracking/opts.py
+++ b/utils_cv/tracking/opts.py
@ -8,21 +8,21 @@ import os.path as osp

 class opts(object):
    """
-    Defines options for experiment settings, system settings, logging, model params, 
+    Defines options for experiment settings, system settings, logging, model params,
    input config, training config, testing config, and tracking params.
    """

    def __init__(
        self,
        load_model: str = "",
-        gpus: str = "0, 1",
+        gpus=[0, 1],
        save_all: bool = False,
        arch: str = "dla_34",
        head_conv: int = -1,
        input_h: int = -1,
        input_w: int = -1,
        lr: float = 1e-4,
-        lr_step: str = "20,27",
+        lr_step=[20, 27],
        num_epochs: int = 30,
        num_iters: int = -1,
        val_intervals: int = 5,
@ -34,13 +34,62 @@ class opts(object):
        reid_dim: int = 512,
        root_dir: str = os.getcwd(),
    ) -> None:
-        self._init_opt()
+        # Set defaults for parameters which are less important
+        self.task = "mot"
+        self.dataset = "jde"
+        self.resume = False
+        self.exp_id = "default"
+        self.test = False
+        self.num_workers = 8
+        self.not_cuda_benchmark = False
+        self.seed = 317
+        self.print_iter = 0
+        self.hide_data_time = False
+        self.metric = "loss"
+        self.vis_thresh = 0.5
+        self.pad = 31
+        self.num_stacks = 1
+        self.down_ratio = 4
+        self.input_res = -1
+        self.num_iters = -1
+        self.trainval = False
+        self.K = 128
+        self.not_prefetch_test = True
+        self.keep_res = False
+        self.fix_res = not self.keep_res
+        self.test_mot16 = False
+        self.val_mot15 = False
+        self.test_mot15 = False
+        self.val_mot16 = False
+        self.test_mot16 = False
+        self.val_mot17 = False
+        self.val_mot20 = False
+        self.test_mot20 = False
+        self.input_video = ""
+        self.output_format = "video"
+        self.output_root = ""
+        self.data_cfg = ""
+        self.data_dir = ""
+        self.mse_loss = False
+        self.hm_gauss = 8
+        self.reg_loss = "l1"
+        self.hm_weight = 1
+        self.off_weight = 1
+        self.wh_weight = 0.1
+        self.id_loss = "ce"
+        self.id_weight = 1
+        self.norm_wh = False
+        self.dense_wh = False
+        self.cat_spec_wh = False
+        self.not_reg_offset = False
+        self.reg_offset = not self.not_reg_offset

+        # Set/overwrite defaults for parameters which are more important
        self.load_model = load_model
        self.gpus = gpus
        self.save_all = save_all
        self.arch = arch
-        self.head_conv = head_conv
+        self.set_head_conv(head_conv)
        self.input_h = input_h
        self.input_w = input_w
        self.lr = lr
@ -53,79 +102,33 @@ class opts(object):
        self.track_buffer = track_buffer
        self.min_box_area = min_box_area
        self.reid_dim = reid_dim
-        self.root_dir = root_dir

+        # init
+        self._init_root_dir(root_dir)
        self._init_batch_sizes(batch_size=12, master_batch_size=-1)
        self._init_dataset_info()

-    def _init_opt(self) -> None:
-        """ Default values for params that aren't exposed by TrackingLearner """
-        self._opt = argparse.Namespace()
-
-        self._opt.task = "mot"
-        self._opt.dataset = "jde"
-        self._opt.resume = False
-        self._opt.exp_id = "default"
-        self._opt.test = False
-        self._opt.num_workers = 8
-        self._opt.not_cuda_benchmark = False
-        self._opt.seed = 317
-        self._opt.print_iter = 0
-        self._opt.hide_data_time = False
-        self._opt.metric = "loss"
-        self._opt.vis_thresh = 0.5
-        self._opt.pad = 31
-        self._opt.num_stacks = 1
-        self._opt.down_ratio = 4
-        self._opt.input_res = -1
-        self._opt.num_iters = -1
-        self._opt.trainval = False
-        self._opt.K = 128
-        self._opt.not_prefetch_test = True
-        self._opt.keep_res = False
-        self._opt.fix_res = not self._opt.keep_res
-        self._opt.test_mot16 = False
-        self._opt.val_mot15 = False
-        self._opt.test_mot15 = False
-        self._opt.val_mot16 = False
-        self._opt.test_mot16 = False
-        self._opt.val_mot17 = False
-        self._opt.val_mot20 = False
-        self._opt.test_mot20 = False
-        self._opt.input_video = ""
-        self._opt.output_format = "video"
-        self._opt.output_root = ""
-        self._opt.data_cfg = ""
-        self._opt.data_dir = ""
-        self._opt.mse_loss = False
-        self._opt.hm_gauss = 8
-        self._opt.reg_loss = "l1"
-        self._opt.hm_weight = 1
-        self._opt.off_weight = 1
-        self._opt.wh_weight = 0.1
-        self._opt.id_loss = "ce"
-        self._opt.id_weight = 1
-        self._opt.norm_wh = False
-        self._opt.dense_wh = False
-        self._opt.cat_spec_wh = False
-        self._opt.not_reg_offset = False
-        self._opt.reg_offset = not self._opt.not_reg_offset
+    def _init_root_dir(self, value):
+        self.root_dir = value
+        self.exp_dir = osp.join(self.root_dir, "exp", self.task)
+        self.save_dir = osp.join(self.exp_dir, self.exp_id)
+        self.debug_dir = osp.join(self.save_dir, "debug")

    def _init_batch_sizes(self, batch_size, master_batch_size) -> None:
-        self._opt.batch_size = batch_size
+        self.batch_size = batch_size

-        self._opt.master_batch_size = (
+        self.master_batch_size = (
            master_batch_size
            if master_batch_size != -1
-            else self._opt.batch_size // len(self._opt.gpus)
+            else self.batch_size // len(self.gpus)
        )
-        rest_batch_size = self._opt.batch_size - self._opt.master_batch_size
-        self._opt.chunk_sizes = [self._opt.master_batch_size]
+        rest_batch_size = self.batch_size - self.master_batch_size
+        self.chunk_sizes = [self.master_batch_size]
        for i in range(len(self.gpus) - 1):
-            chunk = rest_batch_size // (len(self._opt.gpus) - 1)
-            if i < rest_batch_size % (len(self._opt.gpus) - 1):
+            chunk = rest_batch_size // (len(self.gpus) - 1)
+            if i < rest_batch_size % (len(self.gpus) - 1):
                chunk += 1
-            self._opt.chunk_sizes.append(chunk)
+            self.chunk_sizes.append(chunk)

    def _init_dataset_info(self) -> None:
        default_dataset_info = {
@ -144,250 +147,53 @@ class opts(object):
                for k, v in entries.items():
                    self.__setattr__(k, v)

-        dataset = Struct(default_dataset_info[self._opt.task])
-        self._opt.dataset = dataset.dataset
+        dataset = Struct(default_dataset_info[self.task])
+        self.dataset = dataset.dataset
        self.update_dataset_info_and_set_heads(dataset)

    def update_dataset_res(self, input_h, input_w) -> None:
-        self._opt.input_h = input_h
-        self._opt.input_w = input_w
-        self._opt.output_h = self._opt.input_h // self._opt.down_ratio
-        self._opt.output_w = self._opt.input_w // self._opt.down_ratio
-        self._opt.input_res = max(self._opt.input_h, self._opt.input_w)
-        self._opt.output_res = max(self._opt.output_h, self._opt.output_w)
+        self.input_h = input_h
+        self.input_w = input_w
+        self.output_h = self.input_h // self.down_ratio
+        self.output_w = self.input_w // self.down_ratio
+        self.input_res = max(self.input_h, self.input_w)
+        self.output_res = max(self.output_h, self.output_w)

    def update_dataset_info_and_set_heads(self, dataset) -> None:
        input_h, input_w = dataset.default_resolution
-        self._opt.mean, self._opt.std = dataset.mean, dataset.std
-        self._opt.num_classes = dataset.num_classes
+        self.mean, self.std = dataset.mean, dataset.std
+        self.num_classes = dataset.num_classes

-        # input_h(w): opt.input_h overrides opt.input_res overrides dataset default
-        input_h = self._opt.input_res if self._opt.input_res > 0 else input_h
-        input_w = self._opt.input_res if self._opt.input_res > 0 else input_w
-        self.input_h = self._opt.input_h if self._opt.input_h > 0 else input_h
-        self.input_w = self._opt.input_w if self._opt.input_w > 0 else input_w
-        self._opt.output_h = self._opt.input_h // self._opt.down_ratio
-        self._opt.output_w = self._opt.input_w // self._opt.down_ratio
-        self._opt.input_res = max(self._opt.input_h, self._opt.input_w)
-        self._opt.output_res = max(self._opt.output_h, self._opt.output_w)
+        # input_h(w): input_h overrides input_res overrides dataset default
+        input_h = self.input_res if self.input_res > 0 else input_h
+        input_w = self.input_res if self.input_res > 0 else input_w
+        self.input_h = self.input_h if self.input_h > 0 else input_h
+        self.input_w = self.input_w if self.input_w > 0 else input_w
+        self.output_h = self.input_h // self.down_ratio
+        self.output_w = self.input_w // self.down_ratio
+        self.input_res = max(self.input_h, self.input_w)
+        self.output_res = max(self.output_h, self.output_w)

-        if self._opt.task == "mot":
-            self._opt.heads = {
-                "hm": self._opt.num_classes,
-                "wh": 2
-                if not self._opt.cat_spec_wh
-                else 2 * self._opt.num_classes,
-                "id": self._opt.reid_dim,
+        if self.task == "mot":
+            self.heads = {
+                "hm": self.num_classes,
+                "wh": 2 if not self.cat_spec_wh else 2 * self.num_classes,
+                "id": self.reid_dim,
            }
-            if self._opt.reg_offset:
-                self._opt.heads.update({"reg": 2})
-            self._opt.nID = dataset.nID
-            self._opt.img_size = (self._opt.input_w, self._opt.input_h)
+            if self.reg_offset:
+                self.heads.update({"reg": 2})
+            self.nID = dataset.nID
+            self.img_size = (self.input_w, self.input_h)
        else:
            assert 0, "task not defined"

-    ### getters and setters ###
-    @property
-    def load_model(self):
-        return self._load_model
-
-    @load_model.setter
-    def load_model(self, value):
-        self._load_model = value
-        self._opt.load_model = self._load_model
-
-    @property
-    def gpus(self):
-        return self._gpus
-
-    @gpus.setter
-    def gpus(self, value):
-        self._gpus_str = value
+    def set_gpus(self, value):
        gpus_list = [int(gpu) for gpu in value.split(",")]
-        self._gpus = (
+        self.gpus = (
            [i for i in range(len(gpus_list))] if gpus_list[0] >= 0 else [-1]
        )
-        self._opt.gpus_str = self._gpus_str
-        self._opt.gpus = self._gpus
+        self.gpus_str = value

-    @property
-    def save_all(self):
-        return self._save_all
-
-    @save_all.setter
-    def save_all(self, value):
-        self._save_all = value
-        self._opt.save_all = self._save_all
-
-    @property
-    def arch(self):
-        return self._arch
-
-    @arch.setter
-    def arch(self, value):
-        self._arch = value
-        self._opt.arch = self._arch
-
-    @property
-    def head_conv(self):
-        return self._head_conv
-
-    @head_conv.setter
-    def head_conv(self, value):
-        self._head_conv = value if value != -1 else 256
-        self._opt.head_conv = self._head_conv
-
-    @property
-    def input_h(self):
-        return self._input_h
-
-    @input_h.setter
-    def input_h(self, value):
-        self._input_h = value
-        self._opt.input_h = self._input_h
-
-    @property
-    def input_w(self):
-        return self._input_w
-
-    @input_w.setter
-    def input_w(self, value):
-        self._input_w = value
-        self._opt.input_w = self._input_w
-
-    @property
-    def lr(self):
-        return self._lr
-
-    @lr.setter
-    def lr(self, value):
-        self._lr = value
-        self._opt.lr = self._lr
-
-    @property
-    def lr_step(self):
-        return self._lr_step
-
-    @lr_step.setter
-    def lr_step(self, value):
-        self._lr_step = [int(i) for i in value.split(",")]
-        self._opt.lr_step = self._lr_step
-
-    @property
-    def num_epochs(self):
-        return self._num_epochs
-
-    @num_epochs.setter
-    def num_epochs(self, value):
-        self._num_epochs = value
-        self._opt.num_epochs = self._num_epochs
-
-    @property
-    def val_intervals(self):
-        return self._val_intervals
-
-    @val_intervals.setter
-    def val_intervals(self, value):
-        self._val_intervals = value
-        self._opt.val_intervals = self._val_intervals
-
-    @property
-    def conf_thres(self):
-        return self._conf_thres
-
-    @conf_thres.setter
-    def conf_thres(self, value):
-        self._conf_thres = value
-        self._opt.conf_thres = self._conf_thres
-
-    @property
-    def det_thres(self):
-        return self._det_thres
-
-    @det_thres.setter
-    def det_thres(self, value):
-        self._det_thres = value
-        self._opt.det_thres = self._det_thres
-
-    @property
-    def nms_thres(self):
-        return self._nms_thres
-
-    @nms_thres.setter
-    def nms_thres(self, value):
-        self._nms_thres = value
-        self._opt.nms_thres = self._nms_thres
-
-    @property
-    def track_buffer(self):
-        return self._track_buffer
-
-    @track_buffer.setter
-    def track_buffer(self, value):
-        self._track_buffer = value
-        self._opt.track_buffer = self._track_buffer
-
-    @property
-    def min_box_area(self):
-        return self._min_box_area
-
-    @min_box_area.setter
-    def min_box_area(self, value):
-        self._min_box_area = value
-        self._opt.min_box_area = self._min_box_area
-
-    @property
-    def reid_dim(self):
-        return self._reid_dim
-
-    @reid_dim.setter
-    def reid_dim(self, value):
-        self._reid_dim = value
-        self._opt.reid_dim = self._reid_dim
-
-    @property
-    def root_dir(self):
-        return self._root_dir
-
-    @root_dir.setter
-    def root_dir(self, value):
-        self._root_dir = value
-        self._opt.root_dir = self._root_dir
-
-        self._opt.exp_dir = osp.join(self._root_dir, "exp", self._opt.task)
-        self._opt.save_dir = osp.join(self._opt.exp_dir, self._opt.exp_id)
-        self._opt.debug_dir = osp.join(self._opt.save_dir, "debug")
-
-    @property
-    def device(self):
-        return self._device
-
-    @device.setter
-    def device(self, value):
-        self._device = value
-        self._opt.device = self._device
-
-    ### getters only ####
-    @property
-    def opt(self):
-        return self._opt
-
-    @property
-    def resume(self):
-        return self._resume
-
-    @property
-    def task(self):
-        return self._opt.task
-
-    @property
-    def save_dir(self):
-        return self._opt.save_dir
-
-    @property
-    def chunk_sizes(self):
-        return self._opt.chunk_sizes
-
-    @property
-    def heads(self):
-        return self._opt.heads
+    def set_head_conv(self, value):
+        h = value if value != -1 else 256
+        self.head_conv = h
--- a/utils_cv/tracking/plot.py
+++ b/utils_cv/tracking/plot.py
@ -1,12 +1,140 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.

-import os.path as osp
+from collections import OrderedDict
 from typing import Dict, List, Tuple
+
 import cv2
+import decord
+import io
+import IPython.display
 import numpy as np
+from PIL import Image
+from time import sleep

 from .bbox import TrackingBbox
+from .model import _get_frame
+
+
+def plot_single_frame(
+    input_video: str,
+    frame_id: int,
+    results: Dict[int, List[TrackingBbox]] = None
+) -> None:
+    """
+    Plot the bounding box and id on a wanted frame. Display as image to front end.
+
+    Args:
+        input_video: path to the input video
+        frame_id: frame_id for frame to show tracking result
+        results: dictionary mapping frame id to a list of predicted TrackingBboxes
+    """
+
+    # Extract frame
+    im = _get_frame(input_video, frame_id)
+
+    # Overlay results
+    if results:
+        results = OrderedDict(sorted(results.items()))
+
+        # Assign bbox color per id
+        unique_ids = list(
+            set([bb.track_id for frame in results.values() for bb in frame])
+        )
+        color_map = assign_colors(unique_ids)
+
+        # Extract tracking results for wanted frame, and draw bboxes+tracking id, display frame
+        cur_tracks = results[frame_id]
+
+        if len(cur_tracks) > 0:
+            im = draw_boxes(im, cur_tracks, color_map)
+
+    # Display image
+    im = Image.fromarray(im)
+    IPython.display.display(im)
+
+
+def play_video(
+    results: Dict[int, List[TrackingBbox]], input_video: str
+) -> None:
+    """
+     Plot the predicted tracks on the input video. Displays to front-end as sequence of images stringed together in a video.
+
+    Args:
+        results: dictionary mapping frame id to a list of predicted TrackingBboxes
+        input_video: path to the input video
+    """
+
+    results = OrderedDict(sorted(results.items()))
+
+    # assign bbox color per id
+    unique_ids = list(
+        set([bb.track_id for frame in results.values() for bb in frame])
+    )
+    color_map = assign_colors(unique_ids)
+
+    # read video and initialize new tracking video
+    video_reader = decord.VideoReader(input_video)
+
+    # set up ipython jupyter display
+    d_video = IPython.display.display("", display_id=1)
+
+    # Read each frame, add bbox+track id, display frame
+    for frame_idx in range(len(results) - 1):
+        cur_tracks = results[frame_idx]
+        im = video_reader.next().asnumpy()
+
+        if len(cur_tracks) > 0:
+            cur_image = draw_boxes(im, cur_tracks, color_map)
+
+        f = io.BytesIO()
+        im = Image.fromarray(im)
+        im.save(f, "jpeg")
+        d_video.update(IPython.display.Image(data=f.getvalue()))
+        sleep(0.000001)
+
+
+def write_video(
+    results: Dict[int, List[TrackingBbox]], input_video: str, output_video: str
+) -> None:
+    """
+    Plot the predicted tracks on the input video. Write the output to {output_path}.
+
+    Args:
+        results: dictionary mapping frame id to a list of predicted TrackingBboxes
+        input_video: path to the input video
+        output_video: path to write out the output video
+    """
+    results = OrderedDict(sorted(results.items()))
+    # read video and initialize new tracking video
+    video = cv2.VideoCapture()
+    video.open(input_video)
+
+    im_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
+    im_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    fourcc = cv2.VideoWriter_fourcc(*"MP4V")
+    frame_rate = int(video.get(cv2.CAP_PROP_FPS))
+    writer = cv2.VideoWriter(
+        output_video, fourcc, frame_rate, (im_width, im_height)
+    )
+
+    # assign bbox color per id
+    unique_ids = list(
+        set([bb.track_id for frame in results.values() for bb in frame])
+    )
+    color_map = assign_colors(unique_ids)
+
+    # create images and add to video writer, adapted from https://github.com/ZQPei/deep_sort_pytorch
+    frame_idx = 0
+    while video.grab():
+        _, im = video.retrieve()
+        cur_tracks = results[frame_idx]
+        if len(cur_tracks) > 0:
+            im = draw_boxes(im, cur_tracks, color_map)
+        writer.write(im)
+        frame_idx += 1
+
+    print(f"Output saved to {output_video}.")


 def draw_boxes(
@ -14,7 +142,7 @@ def draw_boxes(
    cur_tracks: List[TrackingBbox],
    color_map: Dict[int, Tuple[int, int, int]],
 ) -> np.ndarray:
-    """ 
+    """
    Overlay bbox and id labels onto the frame

    Args:
@ -25,7 +153,6 @@ def draw_boxes(

    cur_ids = [bb.track_id for bb in cur_tracks]
    tracks = dict(zip(cur_ids, cur_tracks))
-
    for label, bb in tracks.items():
        left = round(bb.left)
        top = round(bb.top)
@ -53,11 +180,11 @@ def draw_boxes(


 def assign_colors(id_list: List[int],) -> Dict[int, Tuple[int, int, int]]:
-    """ 
+    """
    Produce corresponding unique color palettes for unique ids
-    
+
    Args:
-        id_list: list of track ids 
+        id_list: list of track ids
    """
    palette = (2 ** 11 - 1, 2 ** 15 - 1, 2 ** 20 - 1)

@ -66,7 +193,7 @@ def assign_colors(id_list: List[int],) -> Dict[int, Tuple[int, int, int]]:

    # adapted from https://github.com/ZQPei/deep_sort_pytorch
    for i in id_list2:
-        color = [int((p * ((i + 1) ** 5 - i + 1)) % 255) for p in palette]
+        color = [int((p * ((i + 1) ** 4 - i + 1)) % 255) for p in palette]
        color_list.append(tuple(color))

    color_map = dict(zip(id_list, color_list))
--- a/utils_cv/tracking/references/fairmot/datasets/dataset/jde.py
+++ b/utils_cv/tracking/references/fairmot/datasets/dataset/jde.py
@ -83,7 +83,7 @@ class LoadImages:  # for inference


 class LoadVideo:  # for inference
-    def __init__(self, path, img_size=(1088, 608)):
+    def __init__(self, path, img_size=(1088, 608)): 
        self.cap = cv2.VideoCapture(path)
        self.frame_rate = int(round(self.cap.get(cv2.CAP_PROP_FPS)))
        self.vw = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
@ -94,8 +94,8 @@ class LoadVideo:  # for inference
        self.height = img_size[1]
        self.count = 0

-        self.w, self.h = self.width, self.height # EDITED
-        print('Lenth of the video: {:d} frames'.format(self.vn))
+        # self.w, self.h = 1920, 1080 EDITED
+        # print('Lenth of the video: {:d} frames'.format(self.vn)) EDITED

    def get_size(self, vw, vh, dw, dh):
        wa, ha = float(dw) / vw, float(dh) / vh
@ -113,7 +113,7 @@ class LoadVideo:  # for inference
        # Read image
        res, img0 = self.cap.read()  # BGR
        assert img0 is not None, 'Failed to load frame {:d}'.format(self.count)
-        img0 = cv2.resize(img0, (self.w, self.h))
+        img0 = cv2.resize(img0, (self.vw, self.vh)) # EDITED

        # Padded resize
        img, _, _, _ = letterbox(img0, height=self.height, width=self.width)
@ -399,13 +399,13 @@ class JointDataset(LoadImagesAndLabels):  # for training
        self.augment = augment
        self.transforms = transforms

-        print('=' * 80)
-        print('dataset summary')
-        print(self.tid_num)
-        print('total # identities:', self.nID)
-        print('start index')
-        print(self.tid_start_index)
-        print('=' * 80)
+#         print('=' * 80)
+#         print('dataset summary')
+#         print(self.tid_num)
+#         print('total # identities:', self.nID)
+#         print('start index')
+#         print(self.tid_start_index)
+#         print('=' * 80)

    def __getitem__(self, files_index):

--- a/utils_cv/tracking/references/fairmot/tracker/multitracker.py
+++ b/utils_cv/tracking/references/fairmot/tracker/multitracker.py
@ -68,7 +68,6 @@ class STrack(BaseTrack):
        self.kalman_filter = kalman_filter
        self.track_id = self.next_id()
        self.mean, self.covariance = self.kalman_filter.initiate(self.tlwh_to_xyah(self._tlwh))
-
        self.tracklet_len = 0
        self.state = TrackState.Tracked
        #self.is_activated = True
@ -165,15 +164,21 @@ class STrack(BaseTrack):


 class JDETracker(object):
-    def __init__(self, opt, frame_rate=30):
+    def __init__(self, opt, frame_rate=30, model=None): # EDITED
        self.opt = opt
        if opt.gpus[0] >= 0:
            opt.device = torch.device('cuda')
        else:
            opt.device = torch.device('cpu')
-        print('Creating model...')
-        self.model = create_model(opt.arch, opt.heads, opt.head_conv)
-        self.model = load_model(self.model, opt.load_model)
+        '''
+        EDITED: only create and load model if model is not None 
+        '''
+        if model is not None:
+            self.model = model
+        else:
+            # print('Creating model...')
+            self.model = create_model(opt.arch, opt.heads, opt.head_conv)
+            self.model = load_model(self.model, opt.load_model)
        self.model = self.model.to(opt.device)
        self.model.eval()

@ -190,6 +195,7 @@ class JDETracker(object):
        self.std = np.array(opt.std, dtype=np.float32).reshape(1, 1, 3)

        self.kalman_filter = KalmanFilter()
+        BaseTrack._count = 0 # EDITED

    def post_process(self, dets, meta):
        dets = dets.detach().cpu().numpy()