Merge pull request #593 from microsoft/staging

Sync master <-> staging
This commit is contained in:
PatrickBue 2020-07-10 20:53:12 +00:00 коммит произвёл GitHub
Родитель 10a6de6666 c108a0c5a0
Коммит c7d01cf0b8
14 изменённых файлов: 1730 добавлений и 1005 удалений

Просмотреть файл

@ -1,8 +1,8 @@
<img src="scenarios/media/logo_cvbp.png" align="right" alt="" width="300"/>
```diff
+ Update June 24: Added action recognition as new core scenario.
+ Object tracking coming soon (in 2-4 weeks).
+ Update July: Added support for action recognition and tracking
+ in the new release v1.2.
```
# Computer Vision
@ -55,6 +55,7 @@ The following is a summary of commonly used Computer Vision scenarios that are c
| [Keypoints](scenarios/keypoints) | Base | Keypoint detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
| [Segmentation](scenarios/segmentation) | Base | Image Segmentation assigns a category to each pixel in an image. |
| [Action recognition](scenarios/action_recognition) | Base | Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times. We also implemented the i3d implementation of action recognition that can be found under (contrib)[contrib]. |
| [Tracking](scenarios/tracking) | Base | Tracking allows to detect and track multiple objects in a video sequence over time. |
| [Crowd counting](contrib/crowd_counting) | Contrib | Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|
We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cv" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.

Просмотреть файл

@ -8,6 +8,7 @@
| [Keypoints](keypoints) | Keypoint Detection can be used to detect specific points on an object. A pre-trained model is provided to detect body joints for human pose estimation. |
| [Segmentation](segmentation) | Image Segmentation assigns a category to each pixel in an image. |
| [Action Recognition](action_recognition) | Action Recognition (also known as activity recognition) consists of classifying various actions from a sequence of frames, such as "reading" or "drinking". |
| [Tracking](tracking) | Tracking allows to detect and track multiple objects in a video sequence over time. |
# Scenarios

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,384 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluating a Multi-Object Tracking Model on MOT Challenge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook provides a framework for evaluating [FairMOT](https://github.com/ifzhang/FairMOT) on the [MOT Challenge dataset](https://motchallenge.net/).\n",
"\n",
"The MOT Challenge datasets are some of the most common benchmarking datasets for measuring multi-object tracking performance on pedestrian data. They provide distinct datasets every few years; their current offerings include MOT15, MOT16/17, and MOT 19/20. These datasets contain various annotated video sequences, each with different tracking difficulties. Additionally, MOT Challenge provides detections for tracking algorithms without detection components.\n",
"\n",
"The goal of this notebook is to re-produce published results on the MOT challenge using the state-of-the-art FairMOT approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Ensure edits to libraries are loaded and plotting is shown in the notebook.\n",
"%reload_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TorchVision: 0.4.0a0+6b959ee\n",
"Torch is using GPU: Tesla K80\n"
]
}
],
"source": [
"import os\n",
"import os.path as osp\n",
"import sys\n",
"import time\n",
"\n",
"from urllib.parse import urljoin\n",
"import torch\n",
"import torchvision\n",
"\n",
"sys.path.append(\"../../\")\n",
"from utils_cv.common.data import data_path, download, unzip_url\n",
"from utils_cv.common.gpu import which_processor, is_windows\n",
"from utils_cv.tracking.data import Urls\n",
"from utils_cv.tracking.dataset import TrackingDataset\n",
"from utils_cv.tracking.model import TrackingLearner\n",
"\n",
"# Change matplotlib backend so that plots are shown for windows\n",
"if is_windows():\n",
" plt.switch_backend(\"TkAgg\")\n",
"\n",
"print(f\"TorchVision: {torchvision.__version__}\")\n",
"which_processor()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above torchvision command displays your machine's GPUs (if it has any) and the compute that `torch/torchvision` is using."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will set some model runtime parameters. Here we will specify the default FairMOT model (dla34) and will evaluate against the MOT17 dataset."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using torch device: cuda\n"
]
}
],
"source": [
"CONF_THRES = 0.4\n",
"TRACK_BUFFER = 30\n",
"\n",
"# Downloaded MOT Challendage data path\n",
"MOT_ROOT_PATH = \"../../data/\"\n",
"RESULT_ROOT = \"./results\"\n",
"EXP_NAME = \"MOT_val_all_dla34\"\n",
"\n",
"BASELINE_MODEL = \"./models/all_dla34.pth\"\n",
"MOTCHALLENGE_BASE_URL = \"https://motchallenge.net/data/\"\n",
"\n",
"# train on the GPU or on the CPU, if a GPU is not available\n",
"device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n",
"print(f\"Using torch device: {device}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model and Dataset Initialization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we will download the [MOT17](https://motchallenge.net/data/MOT17.zip) dataset and save it to `MOT_SAVED_PATH`. Note: MOT17 is around 5GB and it may take some time to download."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training data saved to ../../data/MOT17/train\n",
"Test data saved to ../../data/MOT17/test\n"
]
}
],
"source": [
"mot_path = urljoin(MOTCHALLENGE_BASE_URL, \"MOT17.zip\")\n",
"mot_train_path = osp.join(MOT_ROOT_PATH, \"MOT17\", \"train\")\n",
"mot_test_path = osp.join(MOT_ROOT_PATH, \"MOT17\", \"test\")\n",
"# seqs_str: various video sequences subfolder names under MOT challenge data\n",
"train_seqs = [\n",
" \"MOT17-02-SDP\",\n",
" \"MOT17-04-SDP\",\n",
" \"MOT17-05-SDP\",\n",
" \"MOT17-09-SDP\",\n",
" \"MOT17-10-SDP\",\n",
" \"MOT17-11-SDP\",\n",
" \"MOT17-13-SDP\",\n",
"]\n",
"test_seqs = [\n",
" \"MOT17-01-SDP\",\n",
" \"MOT17-03-SDP\",\n",
" \"MOT17-06-SDP\",\n",
" \"MOT17-07-SDP\",\n",
" \"MOT17-08-SDP\",\n",
" \"MOT17-12-SDP\",\n",
" \"MOT17-14-SDP\",\n",
"]\n",
"\n",
"unzip_url(mot_path, dest=MOT_ROOT_PATH, exist_ok=True)\n",
"print(f\"Training data saved to {mot_train_path}\")\n",
"print(f\"Test data saved to {mot_test_path}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pre-trained, baseline FairMOT model - `all_dla34.pth`- can be downloaded [here](https://drive.google.com/file/d/1udpOPum8fJdoEQm6n0jsIgMMViOMFinu/view). \n",
"\n",
"Please upload and save `all_dla34.pth` to the `BASELINE_MODEL` path."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below initializes and loads the model using the TrackingLearner class."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"tracker = TrackingLearner(None, BASELINE_MODEL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate on Training Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MOT17 provides ground truth annotations for only the training set, so we will be using the training set for evaluation.\n",
"\n",
"To evaluate FairMOT on this dataset, we take advantage of the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-02-SDP.txt\n",
"Evaluate seq: MOT17-02-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-04-SDP.txt\n",
"Evaluate seq: MOT17-04-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-05-SDP.txt\n",
"Evaluate seq: MOT17-05-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-09-SDP.txt\n",
"Evaluate seq: MOT17-09-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-10-SDP.txt\n",
"Evaluate seq: MOT17-10-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-11-SDP.txt\n",
"Evaluate seq: MOT17-11-SDP\n",
"loaded ./models/all_dla34.pth, epoch 10\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-13-SDP.txt\n",
"Evaluate seq: MOT17-13-SDP\n",
" IDF1 IDP IDR Rcll Prcn GT MT PT ML FP FN IDs FM MOTA MOTP IDt IDa IDm\n",
"MOT17-02-SDP 63.9% 77.4% 54.4% 68.5% 97.4% 62 22 31 9 344 5855 183 656 65.7% 0.193 98 33 13\n",
"MOT17-04-SDP 83.7% 86.1% 81.3% 86.3% 91.4% 83 51 20 12 3868 6531 32 201 78.1% 0.171 5 20 2\n",
"MOT17-05-SDP 75.9% 82.9% 70.1% 81.2% 96.0% 133 63 59 11 236 1303 79 207 76.6% 0.199 83 26 40\n",
"MOT17-09-SDP 65.4% 71.6% 60.2% 81.1% 96.5% 26 19 7 0 158 1006 52 105 77.2% 0.165 37 12 7\n",
"MOT17-10-SDP 65.0% 72.1% 59.2% 78.8% 96.0% 57 32 25 0 418 2721 149 404 74.4% 0.213 89 43 14\n",
"MOT17-11-SDP 85.6% 87.9% 83.4% 90.4% 95.2% 75 52 19 4 426 910 38 134 85.4% 0.157 24 19 13\n",
"MOT17-13-SDP 77.0% 82.0% 72.5% 83.9% 94.8% 110 74 29 7 534 1878 86 373 78.5% 0.205 72 26 35\n",
"OVERALL 76.8% 82.3% 71.9% 82.0% 93.9% 546 313 190 43 5984 20204 619 2080 76.1% 0.182 408 179 124\n"
]
}
],
"source": [
"strsummary = tracker.eval_mot(\n",
" conf_thres=CONF_THRES,\n",
" track_buffer=TRACK_BUFFER, \n",
" data_root=mot_train_path,\n",
" seqs=train_seqs,\n",
" result_root=RESULT_ROOT,\n",
" exp_name=EXP_NAME,\n",
")\n",
"print(strsummary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate on Test Set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For evaluating a model on the testing dataset, the MOT Challenge provides the [MOT evaluation server](https://motchallenge.net/instructions/). Here, a user can upload and submit a txt file of prediction results; the service will return metrics. After uploading our results to the [MOT17 evaluation server](https://motchallenge.net/results/MOT17/?det=Private), we can see a MOTA of 68.5 using the 'all_dla34.pth' baseline model.\n",
"\n",
"<img src=\"media/mot_results.PNG\" style=\"width: 737.5px;height: 365px\"/>\n",
"\n",
"The reported evaluation results from [FairMOT paper](https://arxiv.org/abs/2004.01888) with test set are as follows:\n",
"\n",
"| Dataset | MOTA | IDF1 | IDS | MT | ML | FPS |\n",
"|------|------|------|------|------|------|------|\n",
"| MOT16 | 68.7| 70.4| 953| 39.5%| 19.0%| 25.9|\n",
"| MOT17 | 67.5| 69.8| 2868| 37.7%| 20.8%| 25.9|"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-01-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-03-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-06-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-07-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-08-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-12-SDP.txt\n",
"Saved tracking results to ./results/MOT_val_all_dla34/MOT17-14-SDP.txt\n"
]
}
],
"source": [
"tracker.eval_mot(\n",
" conf_thres=CONF_THRES,\n",
" track_buffer=TRACK_BUFFER, \n",
" data_root=mot_test_path,\n",
" seqs=test_seqs,\n",
" result_root=RESULT_ROOT,\n",
" exp_name=EXP_NAME,\n",
" run_eval=False,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "356.258px"
},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -1,81 +1,113 @@
# Multi-Object Tracking
```diff
+ June 2020: This work is ongoing.
```
## Frequently asked questions
This document tries to answer frequent questions related to multi-object tracking. For generic Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).
* General
* [Why FairMOT repository for the tracking algorithm?](#why-FAIRMOT)
* [What are additional complexities that can enhance the current MOT algorithm](#What-are-additional-complexities-that-can-enhance-the-current-MOT-algorithm)
* [What is the difference between online and offline (batch) tracking algorithms?](#What-is-the-difference-between-online-and-offline-tracking-algorithms)
This document includes answers and information relating to common questions and topics regarding multi-object tracking. For more general Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).
* Data
* [How to annotate a video for evaluation?](#how-to-annotate-a-video-for-evaluation)
* [What is the MOT Challenge format used by the evaluation package?](#What-is-the-MOT-Challenge-format-used-by-the-evaluation-package)
* Technology State-of-the-Art (SoTA)
* [What is the architecture of the FairMOT tracking algorithm?](#What-is-the-architecture-of-the-FairMOT-tracking-algorithm)
* [What are SoTA object detectors used in tracking-by-detection trackers?](#What-are-SoTA-object-detectors-used-in-tracking-by-detection-trackers)
* [What are SoTA feature extraction techniques used in tracking-by-detection trackers?](#What-are-SoTA-feature-extraction-techniques-used-in-tracking-by-detection-trackers)
* [What are SoTA affinity and association techniques used in tracking-by-detection trackers?](#What-are-SoTA-affinity-and-association-techniques-used-in-tracking-by-detection-trackers)
* [What are the main evaluation metrics for tracking performance?](#What-are-the-main-evaluation-metrics)
* [How to annotate images?](#how-to-annotate-images)
* Training and Inference
* [How to improve training accuracy?](#how-to-improve-training-accuracy)
* [What are the main training parameters in FairMOT](#what-are-the-main-training-parameters-in-FairMOT)
* [What are the main inference parameters in FairMOT?](#What-are-the-main-inference-parameters-in-FairMOT])
* [What are the training losses for MOT using FairMOT?](#What-are-the-training-losses-for-MOT-using-FairMOT? )
* [What are the training losses in FairMOT?](#what-are-the-training-losses-in-fairmot)
* [What are the main inference parameters in FairMOT?](#what-are-the-main-inference-parameters-in-fairmot)
* MOT Challenge
* [What is the MOT Challenge?](#What-is-the-MOT-Challenge)
* Evaluation
* [What is the MOT Challenge?](#what-is-the-mot-challenge)
* [What are the commonly used evaluation metrics?](#what-are-the-commonly-used-evaluation-metrics)
* State-of-the-Art(SoTA) Technology
* [Popular MOT datasets](#popular-mot-datasets)
* [What is the architecture of the FairMOT tracking algorithm?](#what-is-the-architecture-of-the-fairmot-tracking-algorithm)
* [What object detectors are used in tracking-by-detection trackers?](#what-object-detectors-are-used-in-tracking-by-detection-trackers)
* [What feature extraction techniques are used in tracking-by-detection trackers?](#what-feature-extraction-techniques-are-used-in-tracking-by-detection-trackers)
* [What affinity and association techniques are used in tracking-by-detection trackers?](#what-affinity-and-association-techniques-are-used-in-tracking-by-detection-trackers)
* [What is the difference between online and offline (batch) tracking algorithms?](#what-is-the-difference-between-online-and-offline-tracking-algorithms)
* [Popular Publications and Datasets]
* [Popular Datasets](#popular-datasets)
* [Popular Publications](#popular-publications)
## General
### Why FairMOT?
FairMOT is an [open-source](https://github.com/ifzhang/FairMOT) online one-shot tracking algorithm, that has shown [competitive performance in recent MOT benchmarking challenges](https://motchallenge.net/method/MOT=3015&chl=5), at fast inference speed.
### What are additional complexities that can enhance the current MOT algorithm?
Multi-camera processing, and compensation for camera-movement effect on association features with epipolar geometry.
### What is the difference between online and offline tracking algorithms?
These algorithms differ at the data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames, thus existing tracks are extended or new tracks are created. In offline (batch) tracking , all observations in a batch of frames are considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking is slow, hence not suitable for online tasks such as for autonomous driving. Recently, research has focused on online tracking algorithms, which have reached the performance of offline-tracking while still maintaining high inference speed.
<p align="center">
<img src="./media/fig_onlineBatch.jpg" width="400" align="center"/>
</p>
## Data
### How to annotate a video for evaluation?
We can use an annotation tool, such as VOTT, to annotate a video for ground-truth. For example, for the evaluation video, we can draw bounding boxes around the 2 cans, and tag them as `can_1` and `can_2`:
<p align="center">
<img src="./media/carcans_vott_ui.jpg" width="600" align="center"/>
</p>
### How to annotate images?
Before annotating, make sure to set the extraction rate to match that of the video. After annotation, you can export the annotation results into csv form. You will end up with the extracted frames as well as a csv file containing the bounding box and id info: ``` [image] [xmin] [y_min] [x_max] [y_max] [label]```
For training we use the exact same annotation format as for object detection (see this [FAQ](https://github.com/microsoft/computervision-recipes/blob/master/scenarios/detection/FAQ.md#how-to-annotate-images)). This also means that we train from individual frames, without taking temporal location of these frames into account.
### What is the MOT Challenge format used by the evaluation package?
The evaluation package, from the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository, requires the ground-truth data to be in [MOT challenge](https://motchallenge.net/) format, i.e.:
For evaluation, we follow the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository which requires the ground-truth data to be in [MOT challenge](https://motchallenge.net/) format. The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation:
```
[frame number] [id number] [bbox left] [bbox top] [bbox width] [bbox height][confidence score][class][visibility]
```
The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation.
## Technology State-of-the-Art (SoTA)
See below an example where we use [VOTT](#https://github.com/microsoft/VoTT) to annotate the two cans in the image as `can_1` and `can_2` where `can_1` refers to the white/yellow can and `can_2` refers to the red can. Before annotating, it is important to correctly set the extraction rate to match that of the video. After annotation, you can export the annotation results into several forms, such as PASCAL VOC or .csv form. For the .csv format, VOTT would return the extracted frames, as well as a csv file containing the bounding box and id info:
```
[image] [xmin] [y_min] [x_max] [y_max] [label]
```
<p align="center">
<img src="./media/carcans_vott_ui.jpg" width="800" align="center"/>
</p>
Under the hood (not exposed to the user) the FairMOT repository uses this annotation format for training where each line describes a bounding box as follows, as described in the [Towards-Realtime-MOT](https://github.com/Zhongdao/Towards-Realtime-MOT) repository:
```
[class] [identity] [x_center] [y_center] [width] [height]
```
The `class` field is set to 0, for all, as only single-class multi-object tracking is currently supported by the FairMOT repo (e.g. cans). The field identity is an integer from 0 to num_identities - 1 which maps class names to integers (e.g. coke can, coffee can, etc). The values of [x_center] [y_center] [width] [height] are normalized by the width/height of the image, and range from 0 to 1.
## Training and inference
### What are the training losses in FairMOT?
Losses generated by FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the [FairMOT paper](https://arxiv.org/pdf/2004.01888v2.pdf).
### What are the main inference parameters in FairMOT?
- input_w and input_h: image resolution of the dataset video frames
- conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
- track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.
## Evaluation
### What is the MOT Challenge?
The [MOT Challenge](#https://motchallenge.net/) website hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.
### What are the commonly used evaluation metrics?
As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance. Based on how they are computed, metrics can be event-based [CLEARMOT metrics](https://link.springer.com/content/pdf/10.1155/2008/246309.pdf) or [id-based metrics](https://arxiv.org/pdf/1609.01775.pdf). The main metrics used to gauge performance in the [MOT benchmarking challenge](https://motchallenge.net/results/MOT16/) include MOTA, IDF1, and ID-switch.
* MOTA (Multiple Object Tracking Accuracy) gauges overall accuracy performance using an event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN (false negative), and id-switches (IDSW) normalized over the total number of ground-truth (GT) tracks.
<p align="center">
<img src="./media/eqn_mota.jpg" width="300" align="center"/>
</p>
* IDF1 measures overall performance with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR).
<p align="center">
<img src="./media/eqn_idf1.jpg" width="450" align="center"/>
</p>
* ID-switch measures when the tracker incorrectly changes the ID of a trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which was previously tagged as person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch.
<p align="center">
<img src="./media/fig_tracksEval.jpg" width="600" align="center"/>
</p>
## State-of-the-Art
### What is the architecture of the FairMOT tracking algorithm?
It consists of a single encoder-decoder neural network which extracts high resolution feature maps of the image frame. As a one-shot tracker, these feed into two parallel heads for predicting bounding boxes and re-id features respectively, see [source](https://arxiv.org/pdf/2004.01888v2.pdf):
It consists of a single encoder-decoder neural network that extracts high resolution feature maps of the image frame. As a one-shot tracker, it feeds into two parallel heads for predicting bounding boxes and re-id features respectively, see [source](https://arxiv.org/pdf/2004.01888v2.pdf):
<p align="center">
<img src="./media/figure_fairMOTarc.jpg" width="800" align="center"/>
</p>
@ -87,16 +119,16 @@ Source: [Zhang, 2020](https://arxiv.org/pdf/2004.01888v2.pdf)
</center>
### What are SoTA object detectors used in tracking-by-detection trackers?
### What object detectors are used in tracking-by-detection trackers?
The most popular object detectors used by SoTA tacking algorithms include: [Faster R-CNN](https://arxiv.org/pdf/1506.01497.pdf), [SSD](https://arxiv.org/pdf/1512.02325.pdf) and [YOLOv3](https://arxiv.org/pdf/1804.02767.pdf). Please see our [object detection FAQ page](../detection/faq.md) for more details.
### What are SoTA feature extraction techniques used in tracking-by-detection trackers?
While older algorithms used local features such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have a deep-learning based feature representation. The most common deep-learning approaches use classical CNN to extract visual features, typically trained on re-id datasets, such as the [MARS dataset](http://www.liangzheng.com.cn/Project/project_mars.html). The following figure is an example of a CNN used for MOT by the [DeepSORT tracker](https://arxiv.org/pdf/1703.07402.pdf):
### What feature extraction techniques are used in tracking-by-detection trackers?
While older algorithms used local features, such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have deep-learning based feature representations. The most common deep-learning approaches, typically trained on re-id datasets, use classical CNNs to extract visual features. One such dataset is the [MARS dataset](http://www.liangzheng.com.cn/Project/project_mars.html). The following figure is an example of a CNN used for MOT by the [DeepSORT tracker](https://arxiv.org/pdf/1703.07402.pdf):
<p align="center">
<img src="./media/figure_DeepSortCNN.jpg" width="600" align="center"/>
</p>
Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of CNN networks is used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiates them.
Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of identical CNN networks are used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiate them.
<p align="center">
<img src="./media/figure_SiameseNetwork.jpg" width="400" align="center"/>
</p>
@ -106,7 +138,7 @@ Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or C
</center>
In LSTM network, extracted features from different detections in different time frames are used as inputs to a LSTM network, which predicts the bounding box for the next frame based on the input history.
In an LSTM network, extracted features from different detections in different time frames are used as inputs. The network predicts the bounding box for the next frame based on the input history.
<p align="center">
<img src="./media/figure_LSTM.jpg" width="550" align="center"/>
</p>
@ -122,52 +154,55 @@ Correlation filters can also be convolved with feature maps from CNN network to
</p>
### What are SoTA affinity and association techniques used in tracking-by-detection trackers?
Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching, Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association.
### What affinity and association techniques are used in tracking-by-detection trackers?
Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching, Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association.
In more complex deep-learning approaches, the affinity computation is often merged with feature extraction. For instance, [Siamese CNNs](https://arxiv.org/pdf/1907.12740.pdf) and [Siamese LSTMs](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w21/Wan_An_Online_and_CVPR_2018_paper.pdf) directly output the affinity score.
### What are the main evaluation metrics?
As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance. Based on how they are computed, metrics can be event-based [CLEARMOT metrics](https://link.springer.com/content/pdf/10.1155/2008/246309.pdf) or [id-based metrics](https://arxiv.org/pdf/1609.01775.pdf). The main metrics used to gauge performance in the [MOT benchmarking challenge](https://motchallenge.net/results/MOT16/) include MOTA, IDF1, and ID-switch.
* MOTA (Multiple Object Tracking Accuracy): it gauges overall accuracy performance, with event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN(false negative) and id-switches (IDSW), normalized over the total number of ground-truth (GT) tracks.
<p align="center">
<img src="./media/eqn_mota.jpg" width="200" align="center"/>
</p>
* IDF1: gauges overall performance, with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR):
<p align="center">
<img src="./media/eqn_idf1.jpg" width="450" align="center"/>
</p>
* ID-switch: when the tracker incorrectly changes the ID of the trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which previously tagged person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch.
### What is the difference between online and offline tracking algorithms?
Online and offline algorithms differ at their data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames. Thus, existing tracks are extended or new tracks are created. In offline (batch) tracking, all observations in a batch of frames can be considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking tends to be slower and hence not suitable for tasks which require real-time processing such as autonomous driving.
<p align="center">
<img src="./media/fig_tracksEval.jpg" width="600" align="center"/>
<img src="./media/fig_onlineBatch.jpg" width="400" align="center"/>
</p>
## Training and inference
## Popular publications and datasets
### Popular Datasets
<center>
| Name | Year | Duration | # tracks/ids | Scene | Object type |
| ----- | ----- | -------- | -------------- | ----- | ---------- |
| [MOT15](https://arxiv.org/pdf/1504.01942.pdf)| 2015 | 16 min | 1221 | Outdoor | Pedestrians |
| [MOT16/17](https://arxiv.org/pdf/1603.00831.pdf)| 2016 | 9 min | 1276 | Outdoor & indoor | Pedestrians & vehicles |
| [CVPR19/MOT20](https://arxiv.org/pdf/1906.04567.pdf)| 2019 | 26 min | 3833 | Crowded scenes | Pedestrians & vehicles |
| [PathTrack](http://openaccess.thecvf.com/content_ICCV_2017/papers/Manen_PathTrack_Fast_Trajectory_ICCV_2017_paper.pdf)| 2017 | 172 min | 16287 | YouTube people scenes | Persons |
| [Visdrone](https://arxiv.org/pdf/1804.07437.pdf)| 2019 | - | - | Outdoor view from drone camera | Pedestrians & vehicles |
| [KITTI](http://www.jimmyren.com/papers/rrc_kitti.pdf)| 2012 | 32 min | - | Traffic scenes from car camera | Pedestrians & vehicles |
| [UA-DETRAC](https://arxiv.org/pdf/1511.04136.pdf) | 2015 | 10h | 8200 | Traffic scenes | Vehicles |
| [CamNeT](https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2019-02/egpaper_final.pdf) | 2015 | 30 min | 30 | Outdoor & indoor | Persons |
</center>
### What are the main training parameters in FairMOT?
The main training parameters include batch size, learning rate and number of epochs. Additionally, FairMOT uses Torch's Adam algorithm as the default optimizer.
### Popular publications
<center>
### How to improve training accuracy?
One can improve the training procedure by modifying the learning rate and number of epochs.
| Name | Year | MOT16 IDF1 | MOT16 MOTA | Inference Speed(fps) | Online/ Batch | Detector | Feature extraction/ motion model | Affinity & Association Approach |
| ---- | ---- | ---------- | ---------- | -------------------- | ------------- | -------- | -------------------------------- | -------------------- |
|[A Simple Baseline for Multi-object Tracking -FairMOT](https://arxiv.org/pdf/2004.01888.pdf)|2020|70.4|68.7|25.8|Online|One-shot tracker with detector head|One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter| JV algorithm on IOU, embedding distance,
|[How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor](https://arxiv.org/pdf/1906.06618v2.pdf)|2020|53.4|54.8|1.6|Online|Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN|Tracktor, CNN re-id module|Deep Hungarian Net using Bi-RNN|
|[Tracking without bells and whistles -Tracktor](https://arxiv.org/pdf/1903.05625.pdf)|2019|54.9|56.2|1.6|Online|Modified Faster-RCNN| Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN| Greedy heuristic to merge tracklets using re-id embedding distance|
|[Towards Real-Time Multi-Object Tracking -JDE](https://arxiv.org/pdf/1909.12605v1.pdf)|2019|55.8|64.4|18.5|Online|One-shot tracker - Faster R-CNN with FPN|One-shot - Faster R-CNN with FPN, Kalman Filter|Hungarian Algorithm|
|[Exploit the connectivity: Multi-object tracking with TrackletNet -TNT](https://arxiv.org/pdf/1811.07258.pdf)|2019|56.1|49.2|0.7|Batch|MOT challenge detections|CNN with bbox camera motion compensation, embedding feature similarity|CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization|
|[Extending IOU based Multi-Object Tracking by Visual Information -VIOU](http://elvera.nue.tu-berlin.de/typo3/files/1547Bochinski2018.pdf)|2018|56.1(VisDrone)|40.2(VisDrone)|20(VisDrone)|Batch|Mask R-CNN, CompACT|IOU|KCF to merge tracklets using greedy IOU heuristics|
|[Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT](https://arxiv.org/pdf/1703.07402v1.pdf)|2017|62.2| 61.4|17.4|Online|Modified Faster R-CNN|CNN re-id module, IOU, Kalman Filter|Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance |
|[Multiple people tracking by lifted multicut and person re-identification -LMP](http://openaccess.thecvf.com/content_cvpr_2017/papers/Tang_Multiple_People_Tracking_CVPR_2017_paper.pdf)|2017|51.3|48.8|0.5|Batch|[Public detections](https://arxiv.org/pdf/1610.06136.pdf)|StackeNetPose CNN re-id module|Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization|
### What are the training losses for MOT using FairMOT?
Losses generated by the FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the [FairMOT paper](https://arxiv.org/pdf/2004.01888v2.pdf).
### What are the main inference parameters in FairMOT?
- input_w and input_h: image resolution of the dataset video frames;
- conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
- track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.
## MOT Challenge
### What is the MOT Challenge?
It hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.
</center>

Просмотреть файл

@ -1,17 +1,51 @@
# Multi-Object Tracking
```diff
+ June 2020: All notebooks/code in this directory is work-in-progress and might not fully execute.
This directory provides examples and best practices for building and inferencing multi-object tracking systems. Our goal is to enable users to bring their own datasets and to train a high-accuracy tracking model with ease. While there are many open-source trackers available, we have integrated the [FairMOT](https://github.com/ifzhang/FairMOT) tracker to this repository. The FairMOT algorithm has shown competitive tracking performance in recent MOT benchmarking challenges, while also having respectable inference speeds.
## Setup
The tracking examples in this folder only run on Linux compute targets due to constraints introduced by the [FairMOT](https://github.com/ifzhang/FairMOT) repository.
The following libraries need to be installed in the `cv` conda environment before being able to run the provided notebooks:
```
activate cv
conda install -c conda-forge opencv yacs lap progress
pip install cython_bbox motmetrics
```
This directory provides examples and best practices for building multi-object tracking systems. Our goal is to enable the users to bring their own datasets and train a high-accuracytracking model easily. While there are many open-source trackers available, we have implemented the [FairMOT tracker](https://github.com/ifzhang/FairMOT) specifically, as its algorithm has shown competitive tracking performance in recent MOT benchmarking challenges, at fast inference speed.
In addition, FairMOT's DCNv2 library needs to be compiled using this step:
```
cd utils_cv/tracking/references/fairmot/models/networks/DCNv2
sh make.sh
```
## Why FairMOT?
FairMOT is an [open-source](https://github.com/ifzhang/FairMOT), one-shot online tracking algorithm that has shown [competitive performance in recent MOT benchmarking challenges](https://motchallenge.net/method/MOT=3015&chl=5) at fast inferencing speeds.
Typical tracking algorithms address the detection and feature extraction processes in distinct successive steps. Recent research -[(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf), [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf), [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf)- has moved onto combining the detection and feature embedding processes such that they are learned in a shared model (single network), particularly when both steps involving deep learning models. This framework is called single-shot or one-shot, and has become popular in recent, high-performing models, such as FairMOT [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf), JDE [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf) and TrackRCNN [(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf). Such single-shot models are more efficient than typical tracking-by-detection models and have shown faster inference speeds due to the shared computation of the single network representation of the detection and feature embedding. On the [MOT16 Challenge dataset](https://motchallenge.net/results/MOT16/), FAIRMOT and JDE achieve 25.8 frames per seconds (fps) and 18.5 fps respectively, while DeepSORT_2, a tracking-by-detection tracker, achieves 17.4 fps.
As seen in the table below, the FairMOT model has improved tracking performance when compared to standard MOT trackers (please see the [below](#What-are-the-commonly-used-evaluation-metrics) for more details on performance metrics). The JDE model, which FairMOT builds off, has a much worse ID-switch number [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf). The JDE model uses a typical anchor-based object detector network for feature embedding with a down sampled feature map. This leads to a misalignment between the anchors and the object center, therefore causing re-iding issues. FairMOT solves these issues by estimating the object center instead of the anchors, using a higher resolution feature map for object detection and feature embedding, and by aggregating high-level and low-level features to handle scale variations across different sizes of objects.
<center>
| Tracker | MOTA | IDF1 | ID-Switch | fps |
| -------- | ---- | ---- | --------- | --- |
|DeepSORT_2| 61.4 | 62.2 | 781 | 17.4 |
|JDE| 64.4 | 55.8 | 1544 | 18.5 |
|FairMOT| 68.7 | 70.4 | 953 | 25.8 |
</center>
## Technology
Multi-object-tracking (MOT) is one of the hot research topics in Computer Vision, due to its wide applications in autonomous driving, traffic surveillance, etc. It builds on object detection technology, in order to detect and track all objects in a dynamic scene over time. Inferring target trajectories correctly across successive image frames remains challenging: occlusion happens when objects overlap; the number of and appearance of objects can change. Compared to object detection algorithms, which aim to output rectangular bounding boxes around the objects, MOT algorithms additionally associated an ID number to each box to identify that specific object across the image frames.
Due to its applications in autonomous driving, traffic surveillance, etc., multi-object-tracking (MOT) is a popular and growing area of reseach within Computer Vision. MOT builds on object detection technology to detect and track objects in a dynamic scene over time. Inferring target trajectories correctly across successive image frames remains challenging. For example, occlusion can cause the number and appearance of objects to change, resulting in complications for MOT algorithms. Compared to object detection algorithms, which aim to output rectangular bounding boxes around the objects, MOT algorithms additionally associated an ID number to each box to identify that specific object across the image frames.
As seen in the figure below ([Ciaparrone, 2019](https://arxiv.org/pdf/1907.12740.pdf)), a typical multi-object-tracking algorithm performs part or all of the following steps:
* Detection: Given the input raw image frames (step 1), the detector identifies object(s) on each image frame as bounding box(es) (step 2).
* Feature extraction/motion prediction: For every detected object, visual appearance and motion features are extracted (step 3). Sometimes, a motion predictor (e.g. Kalman Filter) is also added to predict the next position of each tracked target.
* Detection: Given the input raw image frames (step 1), the detector identifies object(s) in each image frame as bounding box(es) (step 2).
* Feature extraction/motion prediction: For every detected object, visual appearance and motion features are extracted (step 3). A motion predictor (e.g. Kalman Filter) is occasionally also added to predict the next position of each tracked target.
* Affinity: The feature and motion predictions are used to calculate similarity/distance scores between pairs of detections and/or tracklets, or the probabilities of detections belonging to a given target or tracklet (step 4).
* Association: Based on these scores/probabilities, a specific numerical ID is assigned to each detected object as it is tracked across successive image frames (step 5).
@ -20,74 +54,20 @@ As seen in the figure below ([Ciaparrone, 2019](https://arxiv.org/pdf/1907.12740
</p>
## State-of-the-art (SoTA)
### Tracking-by-detection (two-step) vs one-shot tracker
Typical tracking algorithms address the detection and feature extraction processes in distinct successive steps. Recent research -[(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf), [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf), [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf)- has moved onto combining the detection and feature embedding processes such that they are learned in a shared model (single network), particularly when both steps involving deep learning models. This framework is called single-shot or one-shot, and recent models include FairMOT [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf), JDE [(Wang et al, 2019)](https://arxiv.org/pdf/1909.12605.pdf) and TrackRCNN [(Voigtlaender et al, 2019)](http://openaccess.thecvf.com/content_CVPR_2019/papers/Voigtlaender_MOTS_Multi-Object_Tracking_and_Segmentation_CVPR_2019_paper.pdf). Such single-shot models are more efficient than typical tracking-by-detection models and have shown faster inference speeds due to the shared computation of the single network representation of the detection and feature embedding: on the [MOT16 Challenge dataset](https://motchallenge.net/results/MOT16/), FAIRMOT and JDE achieve 30 fps and 18.5 fps respectively, while DeepSORT_2, a tracking-by-detection tracker with lower performance achieves 17.4 fps.
As seen in the table below, the FairMOT model has a much improved tracking performance - MOTA, IDF1 (please see the [FAQ](FAQ.md) for more details on performance metrics)-, while the JDE model has a much worse ID-switch number [(Zhang et al, 2020)](https://arxiv.org/pdf/1909.12605.pdf). This is because the FairMOT model uses a typical anchor-based object detector network for feature embedding with a downsampled feature map, leading to a mis-alignment between the anchors and object center, hence re-iding issues. FairMOT solves these issues by: (i) estimating the object center instead of the anchors and using a higher resolution feature map for object detection and feature embedding, (ii) aggregating high-level and low-level features to handle scale variations across different sizes of objects.
<center>
| Tracker | MOTA | IDF1 | ID-Switch | fps |
| -------- | ---- | ---- | --------- | --- |
|DeepSORT_2| 61.4 | 62.2 | 781 | 17.4 |
|JDE| 64.4 | 64.4 | 55.8 |1544 | 18.5 |
|FairMOT| 68.7 | 70.4 | 953 | 25.8 |
</center>
### Popular datasets
<center>
| Name | Year | Duration | # tracks/ids | Scene | Object type |
| ----- | ----- | -------- | -------------- | ----- | ---------- |
| [MOT15](https://arxiv.org/pdf/1504.01942.pdf)| 2015 | 16 min | 1221 | Outdoor | Pedestrians |
| [MOT16/17](https://arxiv.org/pdf/1603.00831.pdf)| 2016 | 9 min | 1276 | Outdoor & indoor | Pedestrians & vehicles |
| [CVPR19/MOT20](https://arxiv.org/pdf/1906.04567.pdf)| 2019 | 26 min | 3833 | Crowded scenes | Pedestrians & vehicles |
| [PathTrack](http://openaccess.thecvf.com/content_ICCV_2017/papers/Manen_PathTrack_Fast_Trajectory_ICCV_2017_paper.pdf)| 2017 | 172 min | 16287 | Youtube people scenes | Persons |
| [Visdrone](https://arxiv.org/pdf/1804.07437.pdf)| 2019 | - | - | Outdoor view from drone camera | Pedestrians & vehicles |
| [KITTI](http://www.jimmyren.com/papers/rrc_kitti.pdf)| 2012 | 32 min | - | Traffic scenes from car camera | Pedestrians & vehicles |
| [UA-DETRAC](https://arxiv.org/pdf/1511.04136.pdf) | 2015 | 10h | 8200 | Traffic scenes | Vehicles |
| [CamNeT](https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2019-02/egpaper_final.pdf) | 2015 | 30 min | 30 | Outdoor & indoor | Persons |
</center>
### Popular publications
| Name | Year | MOT16 IDF1 | MOT16 MOTA | Inference Speed(fps) | Online/ Batch | Detector | Feature extraction/ motion model | Affinity & Association Approach |
| ---- | ---- | ---------- | ---------- | -------------------- | ------------- | -------- | -------------------------------- | -------------------- |
|[A Simple Baseline for Multi-object Tracking -FairMOT](https://arxiv.org/pdf/2004.01888.pdf)|2020|70.4|68.7|25.8|Online|One-shot tracker with detector head|One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter| JV algorithm on IOU, embedding distance,
|[How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor](https://arxiv.org/pdf/1906.06618v2.pdf)|2020|53.4|54.8|1.6|Online|Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN|Tracktor, CNN re-id module|Deep Hungarian Net using Bi-RNN|
|[Tracking without bells and whistles -Tracktor](https://arxiv.org/pdf/1903.05625.pdf)|2019|54.9|56.2|1.6|Online|Modified Faster-RCNN| Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN| Greedy heuristic to merge tracklets using re-id embedding distance|
|[Towards Real-Time Multi-Object Tracking -JDE](https://arxiv.org/pdf/1909.12605v1.pdf)|2019|55.8|64.4|18.5|Online|One-shot tracker - Faster R-CNN with FPN|One-shot - Faster R-CNN with FPN, Kalman Filter|Hungarian Algorithm|
|[Exploit the connectivity: Multi-object tracking with TrackletNet -TNT](https://arxiv.org/pdf/1811.07258.pdf)|2019|56.1|49.2|0.7|Batch|MOT challenge detections|CNN with bbox camera motion compensation, embedding feature similarity|CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization|
|[Extending IOU based Multi-Object Tracking by Visual Information -VIOU](http://elvera.nue.tu-berlin.de/typo3/files/1547Bochinski2018.pdf)|2018|56.1(VisDrone)|40.2(VisDrone)|20(VisDrone)|Batch|Mask R-CNN, CompACT|IOU|KCF to merge tracklets using greedy IOU heuristics|
|[Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT](https://arxiv.org/pdf/1703.07402v1.pdf)|2017|62.2| 61.4|17.4|Online|Modified Faster R-CNN|CNN re-id module, IOU, Kalman Filter|Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance |
|[Multiple people tracking by lifted multicut and person re-identification -LMP](http://openaccess.thecvf.com/content_cvpr_2017/papers/Tang_Multiple_People_Tracking_CVPR_2017_paper.pdf)|2017|51.3|48.8|0.5|Batch|[Public detections](https://arxiv.org/pdf/1610.06136.pdf)|StackeNetPose CNN re-id module|Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization|
## Notebooks
We provide several notebooks to show how multi-object-tracking algorithms can be designed and evaluated:
| Notebook name | Description |
| --- | --- |
| [00_webcam.ipynb](./00_webcam.ipynb)| Quick-start notebook which demonstrates how to build an object tracking system using a single video or webcam as input.
| [01_training_introduction.ipynb](./01_training_introduction.ipynb)| Notebook which explains the basic concepts around model training, inferencing, and evaluation using typical tracking performance metrics.|
| [02_mot_challenge.ipynb](./02_mot_challenge.ipynb) | Notebook which runs inference on a large dataset, the MOT challenge dataset. |
| [01_training_introduction.ipynb](./01_training_introduction.ipynb)| Notebook that explains the basic concepts around model training, inferencing, and evaluation using typical tracking performance metrics.|
| [02_mot_challenge.ipynb](./02_mot_challenge.ipynb) | Notebook that runs model inference on the commonly used MOT Challenge dataset. |
## Frequently Asked Questions
## Frequently asked questions
Answers to frequently asked questions, such as "How does the technology work?" or "What data formats are required?", can be found in the [FAQ](FAQ.md) located in this folder. For generic questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see the [FAQ.md](../classification/FAQ.md) in the classification folder.
Answers to frequently asked questions such as "How does the technology work?", "What data formats are required?" can be found in the [FAQ](FAQ.md) located in this folder. For generic questions such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see the [FAQ.md](../classification/FAQ.md) in the classification folder.
## Contribution guidelines
## Contribution Guidelines
See the [contribution guidelines](../../CONTRIBUTING.md) in the root folder.

Двоичные данные
scenarios/tracking/media/mot_results.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 134 KiB

2
utils_cv/tracking/data.py Normal file → Executable file
Просмотреть файл

@ -8,9 +8,9 @@ from urllib.parse import urljoin
class Urls:
base = "https://cvbp.blob.core.windows.net/public/datasets/tracking/"
cans_path = urljoin(base, "cans.zip")
fridge_objects_path = urljoin(base, "odFridgeObjects_FairMOT-Format.zip")
carcans_annotations_path = urljoin(base, "carcans_vott-csv-export.zip")
carcans_video_path = urljoin(base, "car_cans_8s.mp4")
@classmethod
def all(cls) -> List[str]:

Просмотреть файл

@ -2,44 +2,71 @@
# Licensed under the MIT License.
from collections import OrderedDict
import numpy as np
from functools import partial
import os
import os.path as osp
from pathlib import Path
import random
import tempfile
from typing import Dict, List
import numpy as np
from PIL import Image
from torch.utils.data import DataLoader
from torchvision.transforms import transforms as T
from .bbox import TrackingBbox
from .references.fairmot.datasets.dataset.jde import JointDataset
from .opts import opts
from .references.fairmot.datasets.dataset.jde import JointDataset
from ..common.gpu import db_num_workers
from ..detection.dataset import parse_pascal_voc_anno
from ..detection.plot import plot_detections, plot_grid
class TrackingDataset:
"""A multi-object tracking dataset."""
def __init__(
self, data_root: str, name: str = "default", batch_size: int = 12,
self,
root: str,
name: str = "default",
batch_size: int = 12,
im_dir: str = "images",
anno_dir: str = "annotations",
) -> None:
"""
Args:
data_root: root data directory containing image and annotation subdirectories
name: user-friendly name for the dataset
batch_size: batch size
anno_dir: the name of the annotation subfolder under the root directory
im_dir: the name of the image subfolder under the root directory.
"""
transforms = T.Compose([T.ToTensor()])
self.root = root
self.name = name
self.batch_size = batch_size
self.im_dir = Path(im_dir)
self.anno_dir = Path(anno_dir)
# set these to None so taht can use the 'plot_detections' function
self.keypoints = None
self.mask_paths = None
# Init FairMOT opt object with all parameter settings
opt = opts()
train_list_path = osp.join(data_root, "{}.train".format(name))
with open(train_list_path, "a") as f:
for im_name in sorted(os.listdir(osp.join(data_root, "images"))):
f.write(osp.join("images", im_name) + "\n")
# Read annotations
self._read_annos()
# Save annotation in FairMOT format
self._write_fairMOT_format()
# Create FairMOT dataset object
transforms = T.Compose([T.ToTensor()])
self.train_data = JointDataset(
opt.opt,
data_root,
{name: train_list_path},
opt,
self.root,
{name: self.fairmot_imlist_path},
(opt.input_w, opt.input_h),
augment=True,
transforms=transforms,
@ -57,31 +84,155 @@ class TrackingDataset:
pin_memory=True,
drop_last=True,
)
def _read_annos(self) -> None:
""" Parses all Pascal VOC formatted annotation files to extract all
possible labels. """
# All annotation files are assumed to be in the anno_dir directory,
# and images in the im_dir directory
self.im_filenames = sorted(os.listdir(self.root / self.im_dir))
im_paths = [
os.path.join(self.root / self.im_dir, s) for s in self.im_filenames
]
anno_filenames = [
os.path.splitext(s)[0] + ".xml" for s in self.im_filenames
]
# Read all annotations
self.im_paths = []
self.anno_paths = []
self.anno_bboxes = []
for anno_idx, anno_filename in enumerate(anno_filenames):
anno_path = self.root / self.anno_dir / str(anno_filename)
# Parse annotation file
anno_bboxes, _, _ = parse_pascal_voc_anno(anno_path)
# Store annotation info
self.im_paths.append(im_paths[anno_idx])
self.anno_paths.append(anno_path)
self.anno_bboxes.append(anno_bboxes)
assert len(self.im_paths) == len(self.anno_paths)
# Get list of all labels
labels = []
for anno_bboxes in self.anno_bboxes:
for anno_bbox in anno_bboxes:
if anno_bbox.label_name is not None:
labels.append(anno_bbox.label_name)
self.labels = list(set(labels))
# Set for each bounding box label name also what its integer representation is
for anno_bboxes in self.anno_bboxes:
for anno_bbox in anno_bboxes:
if anno_bbox.label_name is None:
# background rectangle is assigned id 0 by design
anno_bbox.label_idx = 0
else:
label = self.labels.index(anno_bbox.label_name) + 1
anno_bbox.label_idx = label
# Get image sizes. Note that Image.open() only loads the image header,
# not the full images and is hence fast.
self.im_sizes = np.array([Image.open(p).size for p in self.im_paths])
def _write_fairMOT_format(self) -> None:
""" Write bounding box information in the format FairMOT expects for training."""
fairmot_annos_dir = os.path.join(self.root, "labels_with_ids")
os.makedirs(fairmot_annos_dir, exist_ok=True)
# Create for each image a annotation .txt file in FairMOT format
for filename, bboxes, im_size in zip(
self.im_filenames, self.anno_bboxes, self.im_sizes
):
im_width = float(im_size[0])
im_height = float(im_size[1])
fairmot_anno_path = os.path.join(
fairmot_annos_dir, filename[:-4] + ".txt"
)
with open(fairmot_anno_path, "w") as f:
for bbox in bboxes:
tid_curr = bbox.label_idx - 1
x = round(bbox.left + bbox.width() / 2.0)
y = round(bbox.top + bbox.height() / 2.0)
w = bbox.width()
h = bbox.height()
label_str = "0 {:d} {:.6f} {:.6f} {:.6f} {:.6f}\n".format(
tid_curr,
x / im_width,
y / im_height,
w / im_width,
h / im_height,
)
f.write(label_str)
# write all image filenames into a <name>.train file required by FairMOT
self.fairmot_imlist_path = osp.join(
self.root, "{}.train".format(self.name)
)
with open(self.fairmot_imlist_path, "w") as f:
for im_filename in sorted(self.im_filenames):
f.write(osp.join(self.im_dir, im_filename) + "\n")
def show_ims(self, rows: int = 1, cols: int = 3, seed: int = None) -> None:
""" Show a set of images.
Args:
rows: the number of rows images to display
cols: cols to display, NOTE: use 3 for best looking grid
seed: random seed for selecting images
Returns None but displays a grid of annotated images.
"""
if seed:
random.seed(seed or self.seed)
def helper(im_paths):
idx = random.randrange(len(im_paths))
detection = {
"idx": idx,
"im_path": im_paths[idx],
"det_bboxes": [],
}
return detection, self, None, None
plot_grid(
plot_detections,
partial(helper, self.im_paths),
rows=rows,
cols=cols,
)
def boxes_to_mot(results: Dict[int, List[TrackingBbox]]) -> None:
"""
Save the predicted tracks to csv file in MOT challenge format ["frame", "id", "left", "top", "width", "height",]
Args:
results: dictionary mapping frame id to a list of predicted TrackingBboxes
txt_path: path to which results are saved in csv file
"""
"""
# convert results to dataframe in MOT challenge format
preds = OrderedDict(sorted(results.items()))
bboxes = [
[
bb.frame_id,
bb.frame_id + 1,
bb.track_id,
bb.top,
bb.left,
bb.bottom - bb.top,
bb.top,
bb.right - bb.left,
1, -1, -1, -1,
bb.bottom - bb.top,
1,
-1,
-1,
-1,
]
for _, v in preds.items()
for bb in v
]
bboxes_formatted = np.array(bboxes)
return bboxes_formatted
return bboxes_formatted

Просмотреть файл

@ -1,27 +1,26 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import argparse
from collections import OrderedDict, defaultdict
from collections import defaultdict
from copy import deepcopy
import glob
import requests
import os
import os.path as osp
import tempfile #KIP
from typing import Dict, List, Tuple
from typing import Dict, List, Optional, Tuple
import cv2
import matplotlib.pyplot as plt
import motmetrics as mm
import numpy as np
import torch
import torch.cuda as cuda
import torch.nn as nn
from torch.utils.data import DataLoader
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import motmetrics as mm
from .bbox import TrackingBbox
from ..common.gpu import torch_device
from .dataset import TrackingDataset, boxes_to_mot
from .opts import opts
from .references.fairmot.datasets.dataset.jde import LoadImages, LoadVideo
from .references.fairmot.models.model import (
@ -33,54 +32,6 @@ from .references.fairmot.tracker.multitracker import JDETracker
from .references.fairmot.tracking_utils.evaluation import Evaluator
from .references.fairmot.trains.train_factory import train_factory
from .bbox import TrackingBbox
from .dataset import TrackingDataset, boxes_to_mot
from .opts import opts
from .plot import draw_boxes, assign_colors
from ..common.gpu import torch_device
BASELINE_URL = (
"https://drive.google.com/open?id=1udpOPum8fJdoEQm6n0jsIgMMViOMFinu"
)
def _download_baseline(url, destination) -> None:
"""
Download the baseline model .pth file to the destination.
Args:
url: a Google Drive url of the form "https://drive.google.com/open?id={id}"
destination: path to save the model to
Implementation based on https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url
"""
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith("download_warning"):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
session = requests.Session()
id = url.split("id=")[-1]
response = session.get(url, params={"id": id}, stream=True)
token = get_confirm_token(response)
if token:
response = session.get(
url, params={"id": id, "confirm": token}, stream=True
)
save_response_content(response, destination)
def _get_gpu_str():
if cuda.is_available():
@ -90,47 +41,80 @@ def _get_gpu_str():
return "-1" # cpu
def write_video(
results: Dict[int, List[TrackingBbox]], input_video: str, output_video: str
) -> None:
"""
Plot the predicted tracks on the input video. Write the output to {output_path}.
Args:
results: dictionary mapping frame id to a list of predicted TrackingBboxes
input_video: path to the input video
output_video: path to write out the output video
"""
results = OrderedDict(sorted(results.items()))
# read video and initialize new tracking video
def _get_frame(input_video: str, frame_id: int):
video = cv2.VideoCapture()
video.open(input_video)
video.set(cv2.CAP_PROP_POS_FRAMES, frame_id)
_, im = video.read()
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
return im
image_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
image_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
fourcc = cv2.VideoWriter_fourcc(*"MP4V")
frame_rate = int(video.get(cv2.CAP_PROP_FPS))
writer = cv2.VideoWriter(
output_video, fourcc, frame_rate, (image_width, image_height)
def savetxt_results(
results: Dict[int, List[TrackingBbox]],
exp_name: str,
root_path: str,
result_filename: str,
) -> str:
"""Save tracking results to txt in provided path.
Args:
results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]]
exp_name: subfolder for each experiment
root_path: root path for results saved
result_filename: saved prediction results txt file; end with '.txt'
Returns:
result_path: saved prediction results txt file path
"""
# Convert prediction results to mot format
bboxes_mot = boxes_to_mot(results)
# Save results
result_path = osp.join(root_path, exp_name, result_filename)
np.savetxt(result_path, bboxes_mot, delimiter=",", fmt="%s")
return result_path
def evaluate_mot(gt_root_path: str, exp_name: str, result_path: str) -> object:
""" eval code that calls on 'motmetrics' package in referenced FairMOT script, to produce MOT metrics on inference, given ground-truth.
Args:
gt_root_path: path of dataset containing GT annotations in MOTchallenge format (xywh)
exp_name: subfolder for each experiment
result_path: saved prediction results txt file path
Returns:
mot_accumulator: MOTAccumulator object from pymotmetrics package
"""
# Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
evaluator = Evaluator(gt_root_path, exp_name, "mot")
# Run evaluation using pymotmetrics package
mot_accumulator = evaluator.eval_file(result_path)
return mot_accumulator
def mot_summary(accumulators: list, exp_names: list) -> str:
"""Given a list of MOTAccumulators, get total summary by method in 'motmetrics', containing metrics scores
Args:
accumulators: list of MOTAccumulators
exp_names: list of experiment names (str) corresponds to MOTAccumulators
Returns:
strsummary: str output by method in 'motmetrics', containing metrics scores
"""
metrics = mm.metrics.motchallenge_metrics
mh = mm.metrics.create()
summary = Evaluator.get_summary(accumulators, exp_names, metrics)
strsummary = mm.io.render_summary(
summary,
formatters=mh.formatters,
namemap=mm.io.motchallenge_metric_names,
)
# assign bbox color per id
unique_ids = list(
set([bb.track_id for frame in results.values() for bb in frame])
)
color_map = assign_colors(unique_ids)
# create images and add to video writer, adapted from https://github.com/ZQPei/deep_sort_pytorch
frame_idx = 0
while video.grab():
_, cur_image = video.retrieve()
cur_tracks = results[frame_idx]
if len(cur_tracks) > 0:
cur_image = draw_boxes(cur_image, cur_tracks, color_map)
writer.write(cur_image)
frame_idx += 1
print(f"Output saved to {output_video}.")
return strsummary
class TrackingLearner(object):
@ -138,10 +122,10 @@ class TrackingLearner(object):
def __init__(
self,
dataset: TrackingDataset,
model_path: str,
dataset: Optional[TrackingDataset] = None,
model_path: Optional[str] = None,
arch: str = "dla_34",
head_conv: int = None,
head_conv: int = -1,
) -> None:
"""
Initialize learner object.
@ -149,8 +133,8 @@ class TrackingLearner(object):
Defaults to the FairMOT model.
Args:
dataset: the dataset
model_path: path to save model
dataset: optional dataset (required for training)
model_path: optional path to pretrained model (defaults to all_dla34.pth)
arch: the model architecture
Supported architectures: resdcn_34, resdcn_50, resfpndcn_34, dla_34, hrnet_32
head_conv: conv layer channels for output head. None maps to the default setting.
@ -158,25 +142,27 @@ class TrackingLearner(object):
"""
self.opt = opts()
self.opt.arch = arch
self.opt.head_conv = head_conv if head_conv else -1
self.opt.gpus = _get_gpu_str()
self.opt.set_head_conv(head_conv)
self.opt.set_gpus(_get_gpu_str())
self.opt.device = torch_device()
self.dataset = dataset
self.model = self.init_model()
self.model_path = model_path
self.model = None
self._init_model(model_path)
def init_model(self) -> nn.Module:
def _init_model(self, model_path) -> None:
"""
Download and initialize the baseline FairMOT model.
"""
model_dir = osp.join(self.opt.root_dir, "models")
baseline_path = osp.join(model_dir, "all_dla34.pth")
# os.makedirs(model_dir, exist_ok=True)
# _download_baseline(BASELINE_URL, baseline_path)
self.opt.load_model = baseline_path
Initialize the model.
return create_model(self.opt.arch, self.opt.heads, self.opt.head_conv)
Args:
model_path: optional path to pretrained model (defaults to all_dla34.pth)
"""
if not model_path:
model_path = osp.join(self.opt.root_dir, "models", "all_dla34.pth")
assert osp.isfile(
model_path
), f"Model weights not found at {model_path}"
self.opt.load_model = model_path
def fit(
self, lr: float = 1e-4, lr_step: str = "20,27", num_epochs: int = 30
@ -191,87 +177,91 @@ class TrackingLearner(object):
Raise:
Exception if dataset is undefined
Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/train.py
"""
if not self.dataset:
raise Exception("No dataset provided")
if type(lr_step) is not list:
lr_step = [lr_step]
lr_step = [int(x) for x in lr_step]
opt_fit = deepcopy(self.opt) # copy opt to avoid bug
opt_fit.lr = lr
opt_fit.lr_step = lr_step
opt_fit.num_epochs = num_epochs
# update parameters
self.opt.lr = lr
self.opt.lr_step = lr_step
self.opt.num_epochs = num_epochs
opt = deepcopy(self.opt) #to avoid fairMOT over-writing opt
# update dataset options
opt_fit.update_dataset_info_and_set_heads(self.dataset.train_data)
opt.update_dataset_info_and_set_heads(self.dataset.train_data)
# initialize dataloader
train_loader = self.dataset.train_dl
self.optimizer = torch.optim.Adam(self.model.parameters(), opt_fit.lr)
self.model = create_model(
opt.arch, opt.heads, opt.head_conv
)
self.model = load_model(self.model, opt.load_model)
self.optimizer = torch.optim.Adam(self.model.parameters(), opt.lr)
start_epoch = 0
print(f"Loading {opt_fit.load_model}")
self.model = load_model(self.model, opt_fit.load_model)
Trainer = train_factory[opt_fit.task]
trainer = Trainer(opt_fit.opt, self.model, self.optimizer)
trainer.set_device(opt_fit.gpus, opt_fit.chunk_sizes, opt_fit.device)
Trainer = train_factory[opt.task]
trainer = Trainer(opt, self.model, self.optimizer)
trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)
# initialize loss vars
self.losses_dict = defaultdict(list)
# training loop
for epoch in range(
start_epoch + 1, start_epoch + opt_fit.num_epochs + 1
start_epoch + 1, start_epoch + opt.num_epochs + 1
):
print(
"=" * 5,
f" Epoch: {epoch}/{start_epoch + opt_fit.num_epochs} ",
f" Epoch: {epoch}/{start_epoch + opt.num_epochs} ",
"=" * 5,
)
self.epoch = epoch
log_dict_train, _ = trainer.train(epoch, train_loader)
for k, v in log_dict_train.items():
print(f"{k}: {v}")
if epoch in opt_fit.lr_step:
lr = opt_fit.lr * (0.1 ** (opt_fit.lr_step.index(epoch) + 1))
for param_group in optimizer.param_groups:
if k == "time":
print(f"{k}:{v} min")
else:
print(f"{k}: {v}")
if epoch in opt.lr_step:
lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1))
for param_group in self.optimizer.param_groups:
param_group["lr"] = lr
# store losses in each epoch
# store losses in each epoch
for k, v in log_dict_train.items():
if k in ['loss', 'hm_loss', 'wh_loss', 'off_loss', 'id_loss']:
if k in ["loss", "hm_loss", "wh_loss", "off_loss", "id_loss"]:
self.losses_dict[k].append(v)
# save after training because at inference-time FairMOT src reads model weights from disk
self.save(self.model_path)
def plot_training_losses(self, figsize: Tuple[int, int] = (10, 5)) -> None:
"""
Plot training loss.
def plot_training_losses(self, figsize: Tuple[int, int] = (10, 5))->None:
'''
Plots training loss from calling `fit`
Args:
figsize (optional): width and height wanted for figure of training-loss plot
'''
"""
fig = plt.figure(figsize=figsize)
ax1 = fig.add_subplot(1, 1, 1)
ax1.set_xlim([0, len(self.losses_dict['loss']) - 1])
ax1.set_xticks(range(0, len(self.losses_dict['loss'])))
ax1.set_xlim([0, len(self.losses_dict["loss"]) - 1])
ax1.set_xticks(range(0, len(self.losses_dict["loss"])))
ax1.set_xlabel("epochs")
ax1.set_ylabel("losses")
ax1.plot(self.losses_dict['loss'], c="r", label='loss')
ax1.plot(self.losses_dict['hm_loss'], c="y", label='hm_loss')
ax1.plot(self.losses_dict['wh_loss'], c="g", label='wh_loss')
ax1.plot(self.losses_dict['off_loss'], c="b", label='off_loss')
ax1.plot(self.losses_dict['id_loss'], c="m", label='id_loss')
plt.legend(loc='upper right')
ax1.plot(self.losses_dict["loss"], c="r", label="loss")
ax1.plot(self.losses_dict["hm_loss"], c="y", label="hm_loss")
ax1.plot(self.losses_dict["wh_loss"], c="g", label="wh_loss")
ax1.plot(self.losses_dict["off_loss"], c="b", label="off_loss")
ax1.plot(self.losses_dict["id_loss"], c="m", label="id_loss")
plt.legend(loc="upper right")
fig.suptitle("Training losses over epochs")
def save(self, path) -> None:
"""
Save the model to a specified path.
@ -282,95 +272,134 @@ class TrackingLearner(object):
save_model(path, self.epoch, self.model, self.optimizer)
print(f"Model saved to {path}")
def evaluate(self,
results: Dict[int, List[TrackingBbox]],
gt_root_path: str) -> str:
""" eval code that calls on 'motmetrics' package in referenced FairMOT script, to produce MOT metrics on inference, given ground-truth.
def evaluate(
self, results: Dict[int, List[TrackingBbox]], gt_root_path: str
) -> str:
"""
Evaluate performance wrt MOTA, MOTP, track quality measures, global ID measures, and more,
as computed by py-motmetrics on a single experiment. By default, use 'single_vid' as exp_name.
Args:
results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]]
results: prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]]
gt_root_path: path of dataset containing GT annotations in MOTchallenge format (xywh)
Returns:
strsummary: str output by method in 'motmetrics' package, containing metrics scores
strsummary: str output by method in 'motmetrics' package, containing metrics scores
"""
#Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
evaluator = Evaluator(gt_root_path, "single_vid", "mot")
with tempfile.TemporaryDirectory() as tmpdir1:
os.makedirs(osp.join(tmpdir1,'results'))
result_filename = osp.join(tmpdir1,'results', 'results.txt')
# Save results im MOT format for evaluation
bboxes_mot = boxes_to_mot(results)
np.savetxt(result_filename, bboxes_mot, delimiter=",", fmt="%s")
# Run evaluation using pymotmetrics package
accs=[evaluator.eval_file(result_filename)]
# get summary
metrics = mm.metrics.motchallenge_metrics
mh = mm.metrics.create()
summary = Evaluator.get_summary(accs, ("single_vid",), metrics)
strsummary = mm.io.render_summary(
summary,
formatters=mh.formatters,
namemap=mm.io.motchallenge_metric_names
# Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
result_path = savetxt_results(
results, "single_vid", gt_root_path, "results.txt"
)
print(strsummary)
# Save tracking results in tmp
mot_accumulator = evaluate_mot(gt_root_path, "single_vid", result_path)
strsummary = mot_summary([mot_accumulator], ("single_vid",))
return strsummary
def eval_mot(
self,
conf_thres: float,
track_buffer: int,
data_root: str,
seqs: list,
result_root: str,
exp_name: str,
run_eval: bool = True,
) -> str:
"""
Call the prediction function, saves the tracking results to txt file and provides the evaluation results with motmetrics format.
Args:
conf_thres: confidence thresh for tracking
track_buffer: tracking buffer
data_root: data root path
seqs: list of video sequences subfolder names under MOT challenge data
result_root: tracking result path
exp_name: experiment name
run_eval: if we evaluate on provided data
Returns:
strsummary: str output by method in 'motmetrics' package, containing metrics scores
"""
accumulators = []
eval_path = osp.join(result_root, exp_name)
if not osp.exists(eval_path):
os.makedirs(eval_path)
# Loop over all video sequences
for seq in seqs:
result_filename = "{}.txt".format(seq)
im_path = osp.join(data_root, seq, "img1")
result_path = osp.join(result_root, exp_name, result_filename)
with open(osp.join(data_root, seq, "seqinfo.ini")) as seqinfo_file:
meta_info = seqinfo_file.read()
# frame_rate is set from seqinfo.ini by frameRate
frame_rate = int(
meta_info[
meta_info.find("frameRate")
+ 10 : meta_info.find("\nseqLength")
]
)
# Run model inference
if not osp.exists(result_path):
eval_results = self.predict(
im_or_video_path=im_path,
conf_thres=conf_thres,
track_buffer=track_buffer,
frame_rate=frame_rate,
)
result_path = savetxt_results(
eval_results, exp_name, result_root, result_filename
)
print(f"Saved tracking results to {result_path}")
else:
print(f"Loaded tracking results from {result_path}")
# Run evaluation
if run_eval:
print(f"Evaluate seq: {seq}")
mot_accumulator = evaluate_mot(data_root, seq, result_path)
accumulators.append(mot_accumulator)
if run_eval:
strsummary = mot_summary(accumulators, seqs)
return strsummary
else:
return None
def predict(
self,
im_or_video_path: str,
conf_thres: float = 0.6,
det_thres: float = 0.3,
nms_thres: float = 0.4,
track_buffer: int = 30,
min_box_area: float = 200,
im_size: Tuple[float, float] = (None, None),
frame_rate: int = 30,
) -> Dict[int, List[TrackingBbox]]:
"""
Performs inferencing on an image or video path.
Run inference on an image or video path.
Args:
im_or_video_path: path to image(s) or video. Supports jpg, jpeg, png, tif formats for images.
Supports mp4, avi formats for video.
Supports mp4, avi formats for video.
conf_thres: confidence thresh for tracking
det_thres: confidence thresh for detection
nms_thres: iou thresh for nms
track_buffer: tracking buffer
min_box_area: filter out tiny boxes
im_size: (input height, input_weight)
frame_rate: frame rate
Returns a list of TrackingBboxes
Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
"""
opt_pred = deepcopy(self.opt) # copy opt to avoid bug
opt_pred.conf_thres = conf_thres
opt_pred.det_thres = det_thres
opt_pred.nms_thres = nms_thres
opt_pred.track_buffer = track_buffer
opt_pred.min_box_area = min_box_area
input_h, input_w = im_size
input_height = input_h if input_h else -1
input_width = input_w if input_w else -1
opt_pred.update_dataset_res(input_height, input_width)
self.opt.conf_thres = conf_thres
self.opt.track_buffer = track_buffer
self.opt.min_box_area = min_box_area
opt = deepcopy(self.opt) #to avoid fairMOT over-writing opt
# initialize tracker
opt_pred.load_model = self.model_path
tracker = JDETracker(opt_pred.opt, frame_rate=frame_rate)
tracker = JDETracker(opt, frame_rate=frame_rate, model=self.model)
# initialize dataloader
dataloader = self._get_dataloader(
im_or_video_path, opt_pred.input_h, opt_pred.input_w
)
dataloader = self._get_dataloader(im_or_video_path)
frame_id = 0
out = {}
@ -384,9 +413,9 @@ class TrackingLearner(object):
tlbr = t.tlbr
tid = t.track_id
vertical = tlwh[2] / tlwh[3] > 1.6
if tlwh[2] * tlwh[3] > opt_pred.min_box_area and not vertical:
if tlwh[2] * tlwh[3] > opt.min_box_area and not vertical:
bb = TrackingBbox(
tlbr[1], tlbr[0], tlbr[3], tlbr[2], frame_id, tid
tlbr[0], tlbr[1], tlbr[2], tlbr[3], frame_id, tid
)
online_bboxes.append(bb)
out[frame_id] = online_bboxes
@ -394,11 +423,9 @@ class TrackingLearner(object):
return out
def _get_dataloader(
self, im_or_video_path: str, input_h, input_w
) -> DataLoader:
def _get_dataloader(self, im_or_video_path: str) -> DataLoader:
"""
Creates a dataloader from images or video in the given path.
Create a dataloader from images or video in the given path.
Args:
im_or_video_path: path to a root directory of images, or single video or image file.
@ -429,18 +456,18 @@ class TrackingLearner(object):
)
> 0
):
return LoadImages(im_or_video_path, img_size=(input_w, input_h))
return LoadImages(im_or_video_path)
# if path is to a single video file
elif (
osp.isfile(im_or_video_path)
and osp.splitext(im_or_video_path)[1] in video_format
):
return LoadVideo(im_or_video_path, img_size=(input_w, input_h))
return LoadVideo(im_or_video_path)
# if path is to a single image file
elif (
osp.isfile(im_or_video_path)
and osp.splitext(im_or_video_path)[1] in im_format
):
return LoadImages(im_or_video_path, img_size=(input_w, input_h))
return LoadImages(im_or_video_path)
else:
raise Exception("Image or video format not supported")

Просмотреть файл

@ -8,21 +8,21 @@ import os.path as osp
class opts(object):
"""
Defines options for experiment settings, system settings, logging, model params,
Defines options for experiment settings, system settings, logging, model params,
input config, training config, testing config, and tracking params.
"""
def __init__(
self,
load_model: str = "",
gpus: str = "0, 1",
gpus=[0, 1],
save_all: bool = False,
arch: str = "dla_34",
head_conv: int = -1,
input_h: int = -1,
input_w: int = -1,
lr: float = 1e-4,
lr_step: str = "20,27",
lr_step=[20, 27],
num_epochs: int = 30,
num_iters: int = -1,
val_intervals: int = 5,
@ -34,13 +34,62 @@ class opts(object):
reid_dim: int = 512,
root_dir: str = os.getcwd(),
) -> None:
self._init_opt()
# Set defaults for parameters which are less important
self.task = "mot"
self.dataset = "jde"
self.resume = False
self.exp_id = "default"
self.test = False
self.num_workers = 8
self.not_cuda_benchmark = False
self.seed = 317
self.print_iter = 0
self.hide_data_time = False
self.metric = "loss"
self.vis_thresh = 0.5
self.pad = 31
self.num_stacks = 1
self.down_ratio = 4
self.input_res = -1
self.num_iters = -1
self.trainval = False
self.K = 128
self.not_prefetch_test = True
self.keep_res = False
self.fix_res = not self.keep_res
self.test_mot16 = False
self.val_mot15 = False
self.test_mot15 = False
self.val_mot16 = False
self.test_mot16 = False
self.val_mot17 = False
self.val_mot20 = False
self.test_mot20 = False
self.input_video = ""
self.output_format = "video"
self.output_root = ""
self.data_cfg = ""
self.data_dir = ""
self.mse_loss = False
self.hm_gauss = 8
self.reg_loss = "l1"
self.hm_weight = 1
self.off_weight = 1
self.wh_weight = 0.1
self.id_loss = "ce"
self.id_weight = 1
self.norm_wh = False
self.dense_wh = False
self.cat_spec_wh = False
self.not_reg_offset = False
self.reg_offset = not self.not_reg_offset
# Set/overwrite defaults for parameters which are more important
self.load_model = load_model
self.gpus = gpus
self.save_all = save_all
self.arch = arch
self.head_conv = head_conv
self.set_head_conv(head_conv)
self.input_h = input_h
self.input_w = input_w
self.lr = lr
@ -53,79 +102,33 @@ class opts(object):
self.track_buffer = track_buffer
self.min_box_area = min_box_area
self.reid_dim = reid_dim
self.root_dir = root_dir
# init
self._init_root_dir(root_dir)
self._init_batch_sizes(batch_size=12, master_batch_size=-1)
self._init_dataset_info()
def _init_opt(self) -> None:
""" Default values for params that aren't exposed by TrackingLearner """
self._opt = argparse.Namespace()
self._opt.task = "mot"
self._opt.dataset = "jde"
self._opt.resume = False
self._opt.exp_id = "default"
self._opt.test = False
self._opt.num_workers = 8
self._opt.not_cuda_benchmark = False
self._opt.seed = 317
self._opt.print_iter = 0
self._opt.hide_data_time = False
self._opt.metric = "loss"
self._opt.vis_thresh = 0.5
self._opt.pad = 31
self._opt.num_stacks = 1
self._opt.down_ratio = 4
self._opt.input_res = -1
self._opt.num_iters = -1
self._opt.trainval = False
self._opt.K = 128
self._opt.not_prefetch_test = True
self._opt.keep_res = False
self._opt.fix_res = not self._opt.keep_res
self._opt.test_mot16 = False
self._opt.val_mot15 = False
self._opt.test_mot15 = False
self._opt.val_mot16 = False
self._opt.test_mot16 = False
self._opt.val_mot17 = False
self._opt.val_mot20 = False
self._opt.test_mot20 = False
self._opt.input_video = ""
self._opt.output_format = "video"
self._opt.output_root = ""
self._opt.data_cfg = ""
self._opt.data_dir = ""
self._opt.mse_loss = False
self._opt.hm_gauss = 8
self._opt.reg_loss = "l1"
self._opt.hm_weight = 1
self._opt.off_weight = 1
self._opt.wh_weight = 0.1
self._opt.id_loss = "ce"
self._opt.id_weight = 1
self._opt.norm_wh = False
self._opt.dense_wh = False
self._opt.cat_spec_wh = False
self._opt.not_reg_offset = False
self._opt.reg_offset = not self._opt.not_reg_offset
def _init_root_dir(self, value):
self.root_dir = value
self.exp_dir = osp.join(self.root_dir, "exp", self.task)
self.save_dir = osp.join(self.exp_dir, self.exp_id)
self.debug_dir = osp.join(self.save_dir, "debug")
def _init_batch_sizes(self, batch_size, master_batch_size) -> None:
self._opt.batch_size = batch_size
self.batch_size = batch_size
self._opt.master_batch_size = (
self.master_batch_size = (
master_batch_size
if master_batch_size != -1
else self._opt.batch_size // len(self._opt.gpus)
else self.batch_size // len(self.gpus)
)
rest_batch_size = self._opt.batch_size - self._opt.master_batch_size
self._opt.chunk_sizes = [self._opt.master_batch_size]
rest_batch_size = self.batch_size - self.master_batch_size
self.chunk_sizes = [self.master_batch_size]
for i in range(len(self.gpus) - 1):
chunk = rest_batch_size // (len(self._opt.gpus) - 1)
if i < rest_batch_size % (len(self._opt.gpus) - 1):
chunk = rest_batch_size // (len(self.gpus) - 1)
if i < rest_batch_size % (len(self.gpus) - 1):
chunk += 1
self._opt.chunk_sizes.append(chunk)
self.chunk_sizes.append(chunk)
def _init_dataset_info(self) -> None:
default_dataset_info = {
@ -144,250 +147,53 @@ class opts(object):
for k, v in entries.items():
self.__setattr__(k, v)
dataset = Struct(default_dataset_info[self._opt.task])
self._opt.dataset = dataset.dataset
dataset = Struct(default_dataset_info[self.task])
self.dataset = dataset.dataset
self.update_dataset_info_and_set_heads(dataset)
def update_dataset_res(self, input_h, input_w) -> None:
self._opt.input_h = input_h
self._opt.input_w = input_w
self._opt.output_h = self._opt.input_h // self._opt.down_ratio
self._opt.output_w = self._opt.input_w // self._opt.down_ratio
self._opt.input_res = max(self._opt.input_h, self._opt.input_w)
self._opt.output_res = max(self._opt.output_h, self._opt.output_w)
self.input_h = input_h
self.input_w = input_w
self.output_h = self.input_h // self.down_ratio
self.output_w = self.input_w // self.down_ratio
self.input_res = max(self.input_h, self.input_w)
self.output_res = max(self.output_h, self.output_w)
def update_dataset_info_and_set_heads(self, dataset) -> None:
input_h, input_w = dataset.default_resolution
self._opt.mean, self._opt.std = dataset.mean, dataset.std
self._opt.num_classes = dataset.num_classes
self.mean, self.std = dataset.mean, dataset.std
self.num_classes = dataset.num_classes
# input_h(w): opt.input_h overrides opt.input_res overrides dataset default
input_h = self._opt.input_res if self._opt.input_res > 0 else input_h
input_w = self._opt.input_res if self._opt.input_res > 0 else input_w
self.input_h = self._opt.input_h if self._opt.input_h > 0 else input_h
self.input_w = self._opt.input_w if self._opt.input_w > 0 else input_w
self._opt.output_h = self._opt.input_h // self._opt.down_ratio
self._opt.output_w = self._opt.input_w // self._opt.down_ratio
self._opt.input_res = max(self._opt.input_h, self._opt.input_w)
self._opt.output_res = max(self._opt.output_h, self._opt.output_w)
# input_h(w): input_h overrides input_res overrides dataset default
input_h = self.input_res if self.input_res > 0 else input_h
input_w = self.input_res if self.input_res > 0 else input_w
self.input_h = self.input_h if self.input_h > 0 else input_h
self.input_w = self.input_w if self.input_w > 0 else input_w
self.output_h = self.input_h // self.down_ratio
self.output_w = self.input_w // self.down_ratio
self.input_res = max(self.input_h, self.input_w)
self.output_res = max(self.output_h, self.output_w)
if self._opt.task == "mot":
self._opt.heads = {
"hm": self._opt.num_classes,
"wh": 2
if not self._opt.cat_spec_wh
else 2 * self._opt.num_classes,
"id": self._opt.reid_dim,
if self.task == "mot":
self.heads = {
"hm": self.num_classes,
"wh": 2 if not self.cat_spec_wh else 2 * self.num_classes,
"id": self.reid_dim,
}
if self._opt.reg_offset:
self._opt.heads.update({"reg": 2})
self._opt.nID = dataset.nID
self._opt.img_size = (self._opt.input_w, self._opt.input_h)
if self.reg_offset:
self.heads.update({"reg": 2})
self.nID = dataset.nID
self.img_size = (self.input_w, self.input_h)
else:
assert 0, "task not defined"
### getters and setters ###
@property
def load_model(self):
return self._load_model
@load_model.setter
def load_model(self, value):
self._load_model = value
self._opt.load_model = self._load_model
@property
def gpus(self):
return self._gpus
@gpus.setter
def gpus(self, value):
self._gpus_str = value
def set_gpus(self, value):
gpus_list = [int(gpu) for gpu in value.split(",")]
self._gpus = (
self.gpus = (
[i for i in range(len(gpus_list))] if gpus_list[0] >= 0 else [-1]
)
self._opt.gpus_str = self._gpus_str
self._opt.gpus = self._gpus
self.gpus_str = value
@property
def save_all(self):
return self._save_all
@save_all.setter
def save_all(self, value):
self._save_all = value
self._opt.save_all = self._save_all
@property
def arch(self):
return self._arch
@arch.setter
def arch(self, value):
self._arch = value
self._opt.arch = self._arch
@property
def head_conv(self):
return self._head_conv
@head_conv.setter
def head_conv(self, value):
self._head_conv = value if value != -1 else 256
self._opt.head_conv = self._head_conv
@property
def input_h(self):
return self._input_h
@input_h.setter
def input_h(self, value):
self._input_h = value
self._opt.input_h = self._input_h
@property
def input_w(self):
return self._input_w
@input_w.setter
def input_w(self, value):
self._input_w = value
self._opt.input_w = self._input_w
@property
def lr(self):
return self._lr
@lr.setter
def lr(self, value):
self._lr = value
self._opt.lr = self._lr
@property
def lr_step(self):
return self._lr_step
@lr_step.setter
def lr_step(self, value):
self._lr_step = [int(i) for i in value.split(",")]
self._opt.lr_step = self._lr_step
@property
def num_epochs(self):
return self._num_epochs
@num_epochs.setter
def num_epochs(self, value):
self._num_epochs = value
self._opt.num_epochs = self._num_epochs
@property
def val_intervals(self):
return self._val_intervals
@val_intervals.setter
def val_intervals(self, value):
self._val_intervals = value
self._opt.val_intervals = self._val_intervals
@property
def conf_thres(self):
return self._conf_thres
@conf_thres.setter
def conf_thres(self, value):
self._conf_thres = value
self._opt.conf_thres = self._conf_thres
@property
def det_thres(self):
return self._det_thres
@det_thres.setter
def det_thres(self, value):
self._det_thres = value
self._opt.det_thres = self._det_thres
@property
def nms_thres(self):
return self._nms_thres
@nms_thres.setter
def nms_thres(self, value):
self._nms_thres = value
self._opt.nms_thres = self._nms_thres
@property
def track_buffer(self):
return self._track_buffer
@track_buffer.setter
def track_buffer(self, value):
self._track_buffer = value
self._opt.track_buffer = self._track_buffer
@property
def min_box_area(self):
return self._min_box_area
@min_box_area.setter
def min_box_area(self, value):
self._min_box_area = value
self._opt.min_box_area = self._min_box_area
@property
def reid_dim(self):
return self._reid_dim
@reid_dim.setter
def reid_dim(self, value):
self._reid_dim = value
self._opt.reid_dim = self._reid_dim
@property
def root_dir(self):
return self._root_dir
@root_dir.setter
def root_dir(self, value):
self._root_dir = value
self._opt.root_dir = self._root_dir
self._opt.exp_dir = osp.join(self._root_dir, "exp", self._opt.task)
self._opt.save_dir = osp.join(self._opt.exp_dir, self._opt.exp_id)
self._opt.debug_dir = osp.join(self._opt.save_dir, "debug")
@property
def device(self):
return self._device
@device.setter
def device(self, value):
self._device = value
self._opt.device = self._device
### getters only ####
@property
def opt(self):
return self._opt
@property
def resume(self):
return self._resume
@property
def task(self):
return self._opt.task
@property
def save_dir(self):
return self._opt.save_dir
@property
def chunk_sizes(self):
return self._opt.chunk_sizes
@property
def heads(self):
return self._opt.heads
def set_head_conv(self, value):
h = value if value != -1 else 256
self.head_conv = h

Просмотреть файл

@ -1,12 +1,140 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os.path as osp
from collections import OrderedDict
from typing import Dict, List, Tuple
import cv2
import decord
import io
import IPython.display
import numpy as np
from PIL import Image
from time import sleep
from .bbox import TrackingBbox
from .model import _get_frame
def plot_single_frame(
input_video: str,
frame_id: int,
results: Dict[int, List[TrackingBbox]] = None
) -> None:
"""
Plot the bounding box and id on a wanted frame. Display as image to front end.
Args:
input_video: path to the input video
frame_id: frame_id for frame to show tracking result
results: dictionary mapping frame id to a list of predicted TrackingBboxes
"""
# Extract frame
im = _get_frame(input_video, frame_id)
# Overlay results
if results:
results = OrderedDict(sorted(results.items()))
# Assign bbox color per id
unique_ids = list(
set([bb.track_id for frame in results.values() for bb in frame])
)
color_map = assign_colors(unique_ids)
# Extract tracking results for wanted frame, and draw bboxes+tracking id, display frame
cur_tracks = results[frame_id]
if len(cur_tracks) > 0:
im = draw_boxes(im, cur_tracks, color_map)
# Display image
im = Image.fromarray(im)
IPython.display.display(im)
def play_video(
results: Dict[int, List[TrackingBbox]], input_video: str
) -> None:
"""
Plot the predicted tracks on the input video. Displays to front-end as sequence of images stringed together in a video.
Args:
results: dictionary mapping frame id to a list of predicted TrackingBboxes
input_video: path to the input video
"""
results = OrderedDict(sorted(results.items()))
# assign bbox color per id
unique_ids = list(
set([bb.track_id for frame in results.values() for bb in frame])
)
color_map = assign_colors(unique_ids)
# read video and initialize new tracking video
video_reader = decord.VideoReader(input_video)
# set up ipython jupyter display
d_video = IPython.display.display("", display_id=1)
# Read each frame, add bbox+track id, display frame
for frame_idx in range(len(results) - 1):
cur_tracks = results[frame_idx]
im = video_reader.next().asnumpy()
if len(cur_tracks) > 0:
cur_image = draw_boxes(im, cur_tracks, color_map)
f = io.BytesIO()
im = Image.fromarray(im)
im.save(f, "jpeg")
d_video.update(IPython.display.Image(data=f.getvalue()))
sleep(0.000001)
def write_video(
results: Dict[int, List[TrackingBbox]], input_video: str, output_video: str
) -> None:
"""
Plot the predicted tracks on the input video. Write the output to {output_path}.
Args:
results: dictionary mapping frame id to a list of predicted TrackingBboxes
input_video: path to the input video
output_video: path to write out the output video
"""
results = OrderedDict(sorted(results.items()))
# read video and initialize new tracking video
video = cv2.VideoCapture()
video.open(input_video)
im_width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
im_height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
fourcc = cv2.VideoWriter_fourcc(*"MP4V")
frame_rate = int(video.get(cv2.CAP_PROP_FPS))
writer = cv2.VideoWriter(
output_video, fourcc, frame_rate, (im_width, im_height)
)
# assign bbox color per id
unique_ids = list(
set([bb.track_id for frame in results.values() for bb in frame])
)
color_map = assign_colors(unique_ids)
# create images and add to video writer, adapted from https://github.com/ZQPei/deep_sort_pytorch
frame_idx = 0
while video.grab():
_, im = video.retrieve()
cur_tracks = results[frame_idx]
if len(cur_tracks) > 0:
im = draw_boxes(im, cur_tracks, color_map)
writer.write(im)
frame_idx += 1
print(f"Output saved to {output_video}.")
def draw_boxes(
@ -14,7 +142,7 @@ def draw_boxes(
cur_tracks: List[TrackingBbox],
color_map: Dict[int, Tuple[int, int, int]],
) -> np.ndarray:
"""
"""
Overlay bbox and id labels onto the frame
Args:
@ -25,7 +153,6 @@ def draw_boxes(
cur_ids = [bb.track_id for bb in cur_tracks]
tracks = dict(zip(cur_ids, cur_tracks))
for label, bb in tracks.items():
left = round(bb.left)
top = round(bb.top)
@ -53,11 +180,11 @@ def draw_boxes(
def assign_colors(id_list: List[int],) -> Dict[int, Tuple[int, int, int]]:
"""
"""
Produce corresponding unique color palettes for unique ids
Args:
id_list: list of track ids
id_list: list of track ids
"""
palette = (2 ** 11 - 1, 2 ** 15 - 1, 2 ** 20 - 1)
@ -66,7 +193,7 @@ def assign_colors(id_list: List[int],) -> Dict[int, Tuple[int, int, int]]:
# adapted from https://github.com/ZQPei/deep_sort_pytorch
for i in id_list2:
color = [int((p * ((i + 1) ** 5 - i + 1)) % 255) for p in palette]
color = [int((p * ((i + 1) ** 4 - i + 1)) % 255) for p in palette]
color_list.append(tuple(color))
color_map = dict(zip(id_list, color_list))

Просмотреть файл

@ -83,7 +83,7 @@ class LoadImages: # for inference
class LoadVideo: # for inference
def __init__(self, path, img_size=(1088, 608)):
def __init__(self, path, img_size=(1088, 608)):
self.cap = cv2.VideoCapture(path)
self.frame_rate = int(round(self.cap.get(cv2.CAP_PROP_FPS)))
self.vw = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
@ -94,8 +94,8 @@ class LoadVideo: # for inference
self.height = img_size[1]
self.count = 0
self.w, self.h = self.width, self.height # EDITED
print('Lenth of the video: {:d} frames'.format(self.vn))
# self.w, self.h = 1920, 1080 EDITED
# print('Lenth of the video: {:d} frames'.format(self.vn)) EDITED
def get_size(self, vw, vh, dw, dh):
wa, ha = float(dw) / vw, float(dh) / vh
@ -113,7 +113,7 @@ class LoadVideo: # for inference
# Read image
res, img0 = self.cap.read() # BGR
assert img0 is not None, 'Failed to load frame {:d}'.format(self.count)
img0 = cv2.resize(img0, (self.w, self.h))
img0 = cv2.resize(img0, (self.vw, self.vh)) # EDITED
# Padded resize
img, _, _, _ = letterbox(img0, height=self.height, width=self.width)
@ -399,13 +399,13 @@ class JointDataset(LoadImagesAndLabels): # for training
self.augment = augment
self.transforms = transforms
print('=' * 80)
print('dataset summary')
print(self.tid_num)
print('total # identities:', self.nID)
print('start index')
print(self.tid_start_index)
print('=' * 80)
# print('=' * 80)
# print('dataset summary')
# print(self.tid_num)
# print('total # identities:', self.nID)
# print('start index')
# print(self.tid_start_index)
# print('=' * 80)
def __getitem__(self, files_index):

Просмотреть файл

@ -68,7 +68,6 @@ class STrack(BaseTrack):
self.kalman_filter = kalman_filter
self.track_id = self.next_id()
self.mean, self.covariance = self.kalman_filter.initiate(self.tlwh_to_xyah(self._tlwh))
self.tracklet_len = 0
self.state = TrackState.Tracked
#self.is_activated = True
@ -165,15 +164,21 @@ class STrack(BaseTrack):
class JDETracker(object):
def __init__(self, opt, frame_rate=30):
def __init__(self, opt, frame_rate=30, model=None): # EDITED
self.opt = opt
if opt.gpus[0] >= 0:
opt.device = torch.device('cuda')
else:
opt.device = torch.device('cpu')
print('Creating model...')
self.model = create_model(opt.arch, opt.heads, opt.head_conv)
self.model = load_model(self.model, opt.load_model)
'''
EDITED: only create and load model if model is not None
'''
if model is not None:
self.model = model
else:
# print('Creating model...')
self.model = create_model(opt.arch, opt.heads, opt.head_conv)
self.model = load_model(self.model, opt.load_model)
self.model = self.model.to(opt.device)
self.model.eval()
@ -190,6 +195,7 @@ class JDETracker(object):
self.std = np.array(opt.std, dtype=np.float32).reshape(1, 1, 3)
self.kalman_filter = KalmanFilter()
BaseTrack._count = 0 # EDITED
def post_process(self, dets, meta):
dets = dets.detach().cpu().numpy()