* DOCKER: add Dockerfile

* DOCKER: update dockerfile

* DOCKER: update dockerfile

* DOCKER: path

* DOCKER: add cv docker file

* remove the tracking pipeline yml file

* README updates (#358)

* Updating environment.yml file in master (#323)

* readme updates

* mv media to scenarios folder

* fixes

* Update README.md

* simplification of language, removing redundancy

* added target audience section

* Update SETUP.md

* Update README.md

* Update environment.yml

* Update SETUP.md

* env-update (#359)

* Hyperdrive notebook updates (#356)

All tests are passing (except for unrelated AML deployment notebooks)

* transforms fix (#360)

* Updating environment.yml file in master (#323)

* fix for dataset transformations

* remove extra cython in conda

* pr comments'

* refactor to use transform in class param

* remove todo

* update to transformer

* added functionality to show transformations and updated notebook

* Update FAQ.md

* Adding contrib placeholder (#370)

* DOCKER: update readme

* adding missing lib dir

* add i3d

* Adding hard negative sampling notebook (#367)

* DOCKER: use create instead of update

* Add example gif to action recognition readme (#374)

* code clean up

* add i3d

* code clean up

* add action_recognition README content

* add instructions and headers

* fix conflicts

* DOCKER: remove base env bin path

* save/load detection code for deployment (#380)

* Updating environment.yml file in master (#323)

* save/load

* load/save

* load/save

* remove cython duplicate

* remove comment

* docstring

* tests for loading/saving

* label bug

* Syntax issues on lines 07 & 115 (#378)

* Updating environment.yml file in master (#323)

* update maximum time

* Restore example figures (#357)

* Staging (#365)

* README updates (#358)

* Updating environment.yml file in master (#323)

* readme updates

* mv media to scenarios folder

* fixes

* Update README.md

* simplification of language, removing redundancy

* added target audience section

* Update SETUP.md

* Update README.md

* Update environment.yml

* Update SETUP.md

* env-update (#359)

* Hyperdrive notebook updates (#356)

All tests are passing (except for unrelated AML deployment notebooks)

* Syntax issues on lines 07 & 115

* Update README.md

* Update environment.yml

* detection deploy model.py update (#381)

* Updating environment.yml file in master (#323)

* save/load

* load/save

* load/save

* remove cython duplicate

* remove comment

* docstring

* tests for loading/saving

* label bug

* initial notebook

* minor update to model.py

* revert hns nb

* rm nb

* ap at iou 0.5 (#385)

* Updating environment.yml file in master (#323)

* added ap_iou_05

* remove cython

* bug fix

* windows testing fix and other testing bugs (#383)

* Update azure-pipeline-windows-cpu.yml

* Update azure-pipeline-windows-gpu.yml

* Update test_integration_similarity_notebooks.py

* Update test_detection_notebooks.py

* 00 notebook (#386)

* Updating environment.yml file in master (#323)

* 00 notebook update

* remove extra dependency

* remove cython

* remove typo

* Update 10_image_annotation.ipynb

Added link to Azure annotation tool
This commit is contained in:
JS 2019-11-06 10:39:48 -05:00 коммит произвёл GitHub
Родитель 49b29de2b6
Коммит 3e0631e0dc
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
52 изменённых файлов: 4294 добавлений и 708 удалений

Просмотреть файл

@ -1,4 +1,5 @@
# Computer Vision
In recent years, we've see an extra-ordinary growth in Computer Vision, with applications in face recognition, image understanding, search, drones, mapping, semi-autonomous and autonomous vehicles. A key part to many of these applications are visual recognition tasks such as image classification, object detection and image similarity.
This repository provides examples and best practice guidelines for building computer vision systems. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in Computer Vision algorithms, neural architectures, and operationalizing such systems. Rather than creating implementions from scratch, we draw from existing state-of-the-art libraries and build additional utility around loading image data, optimizing and evaluating models, and scaling up to the cloud. In addition, having worked in this space for many years, we aim to answer common questions, point out frequently observed pitfalls, and show how to use the cloud for training and deployment.
@ -20,13 +21,17 @@ notebooks in this repo. Once your environment is setup, navigate to the
## Scenarios
The following is a summary of commonly used Computer Vision scenarios that are covered in this repository. For each of these scenarios, we give you the tools to effectively build your own model. This includes simple tasks such as fine-tuning your own model on your own data, to more complex tasks such as hard-negative mining and even model deployment. See all supported scenarios [here](scenarios).
The following is a summary of commonly used Computer Vision scenarios that are covered in this repository. For each of the main scenarios ("base"), we provide the tools to effectively build your own model. This includes simple tasks such as fine-tuning your own model on your own data, to more complex tasks such as hard-negative mining and even model deployment.
| Scenario | Description |
| -------- | ----------- |
| [Classification](scenarios/classification) | Image Classification is a supervised machine learning technique that allows you to learn and predict the category of a given image. |
| [Similarity](scenarios/similarity) | Image Similarity is a way to compute a similarity score given a pair of images. Given an image, it allows you to identify the most similar image in a given dataset. |
| [Detection](scenarios/detection) | Object Detection is a supervised machine learning technique that allows you to detect the bounding box of an object within an image. |
| Scenario | Support | Description |
| -------- | ----------- | ----------- |
| [Classification](scenarios/classification) | Base | Image Classification is a supervised machine learning technique that allows you to learn and predict the category of a given image. |
| [Similarity](scenarios/similarity) | Base | Image Similarity is a way to compute a similarity score given a pair of images. Given an image, it allows you to identify the most similar image in a given dataset. |
| [Detection](scenarios/detection) | Base | Object Detection is a supervised machine learning technique that allows you to detect the bounding box of an object within an image. |
| [Action recognition](contrib/action_recognition) | Contrib | COMING SOON. Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
| [Crowd counting](contrib/crowd_counting) | Contrib | COMING SOON. Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|
We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cs" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.
## Computer Vision on Azure
@ -45,7 +50,7 @@ If you need to train your own model, the following services and links provide ad
- [Azure Machine Learning service (AzureML)](https://azure.microsoft.com/en-us/services/machine-learning-service/)
is a service that helps users accelerate the training and deploying of machine learning models. While not specific for computer vision workloads, the AzureML Python SDK can be used for scalable and reliable training and deployment of machine learning solutions to the cloud. We leverage Azure Machine Learning in several of the notebooks within this repository (e.g. [deployment to Azure Kubernetes Service](classification/notebooks/22_deployment_on_azure_kubernetes_service.ipynb))
- [Azure AI Reference architectures](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-python-models)
- [Azure AI Reference architectures](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-python-models)
provide a set of examples (backed by code) of how to build common AI-oriented workloads that leverage multiple cloud components. While not computer vision specific, these reference architectures cover several machine learning workloads such as model deployment or batch scoring.
## Build Status
@ -62,8 +67,8 @@ provide a set of examples (backed by code) of how to build common AI-oriented wo
### AzureML Testing
| Build Type | Branch | Status | | Branch | Status |
| --- | --- | --- | --- | --- | --- |
| Build Type | Branch | Status | | Branch | Status |
| --- | --- | --- | --- | --- | --- |
| **Linxu GPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/bp-azureml-unit-test-linux-gpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=41&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/bp-azureml-unit-test-linux-gpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=41&branchName=staging)|
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/aml-unit-test-linux-cpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=37&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/aml-unit-test-linux-cpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=37&branchName=staging)|
| **Notebook unit GPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/azureml-unit-test-linux-nb-gpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=42&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/azureml-unit-test-linux-nb-gpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=42&branchName=staging) |
@ -73,5 +78,3 @@ provide a set of examples (backed by code) of how to build common AI-oriented wo
## Contributing
This project welcomes contributions and suggestions. Please see our [contribution guidelines](CONTRIBUTING.md).

Просмотреть файл

@ -3,7 +3,7 @@
This document describes how to setup all the dependencies to run the notebooks
in this repository.
Many computer visions scenarios are extremely computationlly heavy. Training a
Many computer visions scenarios are extremely computationally heavy. Training a
model often requires a machine that has a GPU, and would otherwise be too slow.
We recommend using the GPU-enabled [Azure Data Science Virtual Machine (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) since it comes prepared with a lot of the prerequisites needed to efficiently do computer vision.
@ -112,10 +112,11 @@ $ssh -L local_port:remote_address:remote_port <username>@<server-ip>
For example, if I want to run `jupyter notebook --port 8888` on my VM and I
wish to run the Jupyter notebooks on my local broswer on `localhost:9999`, I
would ssh into my VM using the following commend:
would ssh into my VM using the following command:
```
$ssh -L 9999:localhost:8888 <username>@<server-ip>
```
This command will allow your local machine's port 9999 to access your remote
machine's port 8888.
This command will allow your local machine's port `9999` to access your remote
machine's port `8888`.

Просмотреть файл

@ -7,4 +7,5 @@ Each project should live in its own subdirectory ```/contrib/<project>``` and co
| Directory | Project description |
|---|---|
| vm_builder | This script helps users easily create an Ubuntu Data Science Virtual Machine with a GPU with the Computer Vision repo installed and ready to be used. If you find the script to be out-dated or not working, you can create the VM using the Azure portal or the Azure CLI tool with a few more steps. |
| [Action recognition](action_recognition) | COMING SOON. Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
| [vm_builder](vm_builder) | This script helps users easily create an Ubuntu Data Science Virtual Machine with a GPU with the Computer Vision repo installed and ready to be used. If you find the script to be out-dated or not working, you can create the VM using the Azure portal or the Azure CLI tool with a few more steps. |

Просмотреть файл

@ -0,0 +1,66 @@
# Action Recognition
This is a place holder. Content will follow soon.
![](./media/action_recognition.gif)
*Example of action recognition*
## Overview
| Folders | Description |
| -------- | ----------- |
| [i3d](i3d) | Scripts for fine-tuning a pre-trained Two-Stream Inflated 3D ConvNet (I3D) model on the HMDB-51 dataset
| [video_annotation](video_annotation) | Instructions and helper functions to annotate the start and end position of actions in video footage|
## Functionality
In [i3d](i3d) we show how to fine-tune a Two-Stream Inflated 3D ConvNet (I3D) model. This model was introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\] and achieved state-of-the-art in action classification on the HMDB-51 and UCF-101 datasets. The paper demonstrated the effectiveness of pre-training action recognition models on large datasets - in this case the Kinetics Human Action Video dataset consisting of 306k examples and 400 classes. We provide code for replicating the results of this paper on HMDB-51. We use models pre-trained on Kinetics from [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d). Evaluating the model on the test set of the HMDB-51 dataset (split 1) using [i3d/test.py](i3d/test.py) should yield the following results:
| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
| ------- | -------| ------- |
| RGB | 74.8 | 73.7 |
| Optical flow | 77.1 | 77.5 |
| Two-Stream | 80.7 | 81.2 |
In order to train an action recognition model for a specific task, annotated training data from the relevant domain is needed. In [video_annotation](video_annotation), we provide tips and examples for how to use a best-in-class video annotation tool ([VGG Image Annotator](http://www.robots.ox.ac.uk/~vgg/software/via/)) to label the start and end positions of actions in videos.
## State-of-the-art
In the tables below, we list datasets which are commonly used and also give an overview of the state-of-the-art. Note that the information below is reasonably exhaustive and should cover most major publications until 2018. Expect however some level of incompleteness and slight incorrectness (e.g. publication year being off by plus/minus 1 year due) since the tables below were mainly compiled to give a high-level picture of where the field is and how it evolved over the last years.
Recommended reading:
- As introduction to action recognition the blog [Deep Learning for Videos: A 2018 Guide to Action Recognition](http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review).
- [ActionRecognition.net](http://actionrecognition.net/files/dset.php) for the latest state-of-the-art accuracies on popular research benchmark datasets.
- All papers highlighted in yellow in the publications table below.
Popular datasets:
| Name | Year | Number of classes | #Clips | Average length per video | Notes |
| ----- | ----- | ----------------- | ------- | ------------------------- | ----------- |
| KTH | 2004| 6| 600| | | |
|Weizmann| 2005| 9| 81| | |
|HMDB-51| 2011| 51| 6.8k| | |
|UCF-101| 2012| 101| 13.3k| 7 sec (min: 1sec, max: 71sec)| |
|Sports-1M| 2014| 487| 1M| | |
|THUMOS14| 2014| 101| 18k| (total: 254h)| Dataset for temporal action |
|ActivityNet| 2015| 200| 28.1k| ? 1 min 40 sec| |
|Charades| 2016| 157| 66.5k from 9848 videos| Each video (not action) is 30 seconds | Daily tasks, classification and temporal localization challenges|
|Youtube-8M| 2016| 4800 | | | Not an action dataset, but rather a classification one (ie what objects occur in each video). Additional videos added in 2018.|
|Kinetics-400| 2017| 400| 306k| 10 sec| |
|Kinetics-600| 2018| 600| 496k| |
|Something-Something| 2017| 174| 110k| 2-6 sec | Low level actions, e.g. "pushing something left to right". Additional videos added in 2019.|
|AVA| 2018| 80| 1.6M in 430 videos| Each video is 15min long with 1 frame annotated per second with location of person, and for each person one of 80 "atomic" actions. Combine people annotations into tracks.
|Youtube-8M Segments| 2019| 1000| 237k| 5sec| Used for localization Kaggle challenge. Think focuses on objects, not actions.|
Popular publications, with recommended papers to read highlighted in yellow:
<img align="center" src="./media/publications.png"/>
Most pulications focus on accuracy rather than on inferencing speed. The paper "Representation Flow for Action Recognition" is a noteworthy exception with this figure:
<img align="center" src="./media/inference_speeds.png" width = "500"/>
\[1\] J. Carreira and A. Zisserman. Quo vadis, action recognition?
a new model and the kinetics dataset. In CVPR, 2017.

7
contrib/action_recognition/i3d/.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,7 @@
__pycache__/
models/__pycache__/
log/
.vscode/
checkpoints/
pretrained_models/
inference/.ipynb_checkpoints/

Просмотреть файл

@ -0,0 +1,61 @@
## Fine-tuning I3D model on HMDB-51
In this section we provide code for training a Two-Stream Inflated 3D ConvNet (I3D), introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\]. The code uses the Pytorch models provided in [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d) - which have been pre-trained on the Kinetics Human Action Video dataset - and fine-tunes the models on the HMDB-51 action recognition dataset. The I3D model consists of two "streams" which are independently trained models. One stream takes the RGB image frames from videos as input and the other stream takes pre-computed optical flow as input. At test time, the outputs of each stream model are averaged to make the final prediction. The model results are as follows:
| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
| ------- | -------| ------- |
| RGB | 74.8 | 73.7 |
| Optical flow | 77.1 | 77.5 |
| Two-Stream | 80.7 | 81.2 |
## Download and pre-process HMDB-51 data
Download the HMDB-51 video database from [here](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Extract the videos with
```
mkdir rars && mkdir videos
unrar x hmdb51-org.rar rars/
for a in $(ls rars); do unrar x "rars/${a}" videos/; done;
```
Use code provided in [https://github.com/yjxiong/temporal-segment-networks](https://github.com/yjxiong/temporal-segment-networks) to preprocess the raw videos into split videos into RGB frames and compute optical flow frames:
```
git clone https://github.com/yjxiong/temporal-segment-networks
cd temporal-segment-networks
bash scripts/extract_optical_flow.sh /path/to/hmdb51/videos /path/to/rawframes/output
```
Edit the _C.DATASET.DIR option in [default.py](default.py) to point towards the rawframes input data directory.
## Setup environment
Setup environment
```
conda env create -f environment.yaml
conda activate i3d
```
## Download pretrained models
Download pretrained models
```
bash download_models.sh
```
## Fine-tune pretrained models on HMDB-51
Train RGB model
```
python train.py --cfg config/train_rgb.yaml
```
Train flow model
```
python train.py --cfg config/train_flow.yaml
```
Evaluate combined model
```
python test.py
```
\[1\] J. Carreira and A. Zisserman. Quo vadis, action recognition?
a new model and the kinetics dataset. In CVPR, 2017.

Просмотреть файл

@ -0,0 +1,4 @@
MODEL:
NAME: "i3d_flow"
TRAIN:
MODALITY: "flow"

Просмотреть файл

@ -0,0 +1,4 @@
MODEL:
NAME: "i3d_rgb"
TRAIN:
MODALITY: "RGB"

Просмотреть файл

@ -0,0 +1,244 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# Adapted from https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/dataset.py
import torch.utils.data as data
import torch
from PIL import Image
import os
import os.path
import numpy as np
from numpy.random import randint
from pathlib import Path
import torchvision
from torchvision import datasets, transforms
from videotransforms import (
GroupRandomCrop, GroupRandomHorizontalFlip,
GroupScale, GroupCenterCrop, GroupNormalize, Stack
)
from itertools import cycle
class VideoRecord(object):
def __init__(self, row):
self._data = row
@property
def path(self):
return self._data[0]
@property
def num_frames(self):
return int(
len([x for x in Path(
self._data[0]).glob('img_*')])-1)
@property
def label(self):
return int(self._data[1])
class I3DDataSet(data.Dataset):
def __init__(self, data_root, split=1, sample_frames=64,
modality='RGB', transform=lambda x:x,
train_mode=True, sample_frames_at_test=False):
self.data_root = data_root
self.split = split
self.sample_frames = sample_frames
self.modality = modality
self.transform = transform
self.train_mode = train_mode
self.sample_frames_at_test = sample_frames_at_test
self._parse_split_files()
def _parse_split_files(self):
# class labels assigned by sorting the file names in /data/hmdb51_splits directory
file_list = sorted(Path('./data/hmdb51_splits').glob('*'+str(self.split)+'.txt'))
video_list = []
for class_idx, f in enumerate(file_list):
class_name = str(f).strip().split('/')[2][:-16]
for line in open(f):
tokens = line.strip().split(' ')
video_path = self.data_root+class_name+'/'+tokens[0][:-4]
record = (video_path, class_idx)
# 1 indicates video should be in training set
if self.train_mode & (tokens[-1] == '1'):
video_list.append(VideoRecord(record))
# 2 indicates video should be in test set
elif (self.train_mode == False) & (tokens[-1] == '2'):
video_list.append(VideoRecord(record))
self.video_list = video_list
def _load_image(self, directory, idx):
if self.modality == 'RGB':
img_path = os.path.join(directory, 'img_{:05}.jpg'.format(idx))
try:
img = Image.open(img_path).convert('RGB')
except:
print("Couldn't load image:{}".format(img_path))
return None
return img
else:
try:
img_path = os.path.join(directory, 'flow_x_{:05}.jpg'.format(idx))
x_img = Image.open(img_path).convert('L')
except:
print("Couldn't load image:{}".format(img_path))
return None
try:
img_path = os.path.join(directory, 'flow_y_{:05}.jpg'.format(idx))
y_img = Image.open(img_path).convert('L')
except:
print("Couldn't load image:{}".format(img_path))
return None
# Combine flow images into single PIL image
x_img = np.array(x_img, dtype=np.float32)
y_img = np.array(y_img, dtype=np.float32)
img = np.asarray([x_img, y_img]).transpose([1, 2, 0])
img = Image.fromarray(img.astype('uint8'))
return img
def _sample_indices(self, record):
if record.num_frames > self.sample_frames:
start_pos = randint(record.num_frames - self.sample_frames + 1)
indices = range(start_pos, start_pos + self.sample_frames, 1)
else:
indices = [x for x in range(record.num_frames)]
if len(indices) < self.sample_frames:
self._loop_indices(indices)
return indices
def _loop_indices(self, indices):
indices_cycle = cycle(indices)
while len(indices) < self.sample_frames:
indices.append(next(indices_cycle))
def __getitem__(self, index):
record = self.video_list[index]
# Sample frames from the the video for training, or if sampling
# turned on at test time
if self.train_mode or self.sample_frames_at_test:
segment_indices = self._sample_indices(record)
else:
segment_indices = [i for i in range(record.num_frames)]
# Image files are 1-indexed
segment_indices = [i+1 for i in segment_indices]
# Get video frame images
images = []
for i in segment_indices:
seg_img = self._load_image(record.path, i)
if seg_img is None:
raise ValueError("Couldn't load", record.path, i)
images.append(seg_img)
# Apply transformations
transformed_images = self.transform(images)
return transformed_images, record.label
def __len__(self):
return len(self.video_list)
if __name__ == '__main__':
input_size = 224
resize_small_edge = 256
train_rgb = I3DDataSet(
data_root='/datadir/rawframes/',
split=1,
sample_frames = 64,
modality='RGB',
train_mode=True,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(resize_small_edge),
GroupRandomCrop(input_size),
GroupRandomHorizontalFlip(),
GroupNormalize(modality="RGB"),
Stack(),
])
)
item = train_rgb.__getitem__(10)
print("train_rgb:")
print(item[0].size())
print("max=", item[0].max())
print("min=", item[0].min())
print("label=",item[1])
val_rgb = I3DDataSet(
data_root='/datadir/rawframes/',
split=1,
sample_frames = 64,
modality='RGB',
train_mode=False,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(resize_small_edge),
GroupCenterCrop(input_size),
GroupNormalize(modality="RGB"),
Stack(),
])
)
item = val_rgb.__getitem__(10)
print("val_rgb:")
print(item[0].size())
print("max=", item[0].max())
print("min=", item[0].min())
print("label=",item[1])
train_flow = I3DDataSet(
data_root='/datadir/rawframes/',
split=1,
sample_frames = 64,
modality='flow',
train_mode=True,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(resize_small_edge),
GroupRandomCrop(input_size),
GroupRandomHorizontalFlip(),
GroupNormalize(modality="flow"),
Stack(),
])
)
item = train_flow.__getitem__(100)
print("train_flow:")
print(item[0].size())
print("max=", item[0].max())
print("min=", item[0].min())
print("label=",item[1])
val_flow = I3DDataSet(
data_root='/datadir/rawframes/',
split=1,
sample_frames = 64,
modality='flow',
train_mode=False,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(resize_small_edge),
GroupCenterCrop(input_size),
GroupNormalize(modality="flow"),
Stack(),
])
)
item = val_flow.__getitem__(100)
print("val_flow:")
print(item[0].size())
print("max=", item[0].max())
print("min=", item[0].min())
print("label=",item[1])

Просмотреть файл

@ -0,0 +1,73 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
from yacs.config import CfgNode as CN
_C = CN()
_C.LOG_DIR = "log"
_C.WORKERS = 16
_C.PIN_MEMORY = True
_C.SEED = 42
# Cudnn related params
_C.CUDNN = CN()
_C.CUDNN.BENCHMARK = True
# Dataset
_C.DATASET = CN()
_C.DATASET.SPLIT = 1
_C.DATASET.DIR = "/datadir/rawframes/"
_C.DATASET.NUM_CLASSES = 51
# NETWORK
_C.MODEL = CN()
_C.MODEL.NAME = "i3d_flow"
_C.MODEL.PRETRAINED_RGB = "pretrained_models/rgb_imagenet_kinetics.pt"
_C.MODEL.PRETRAINED_FLOW = "pretrained_models/flow_imagenet_kinetics.pt"
_C.MODEL.CHECKPOINT_DIR = "checkpoints"
# Train
_C.TRAIN = CN()
_C.TRAIN.PRINT_FREQ = 50
_C.TRAIN.INPUT_SIZE = 224
_C.TRAIN.RESIZE_MIN = 256
_C.TRAIN.SAMPLE_FRAMES = 64
_C.TRAIN.MODALITY = "flow"
_C.TRAIN.BATCH_SIZE = 24
_C.TRAIN.GRAD_ACCUM_STEPS = 4
_C.TRAIN.MAX_EPOCHS = 50
# Test
_C.TEST = CN()
_C.TEST.EVAL_FREQ = 5
_C.TEST.PRINT_FREQ = 250
_C.TEST.BATCH_SIZE = 1
_C.TEST.MODALITY = "combined"
_C.TEST.MODEL_RGB = "pretrained_models/rgb_hmdb_split1.pt"
_C.TEST.MODEL_FLOW = "pretrained_models/flow_hmdb_split1.pt"
def update_config(cfg, options=None, config_file=None):
cfg.defrost()
if config_file:
cfg.merge_from_file(config_file)
if options:
cfg.merge_from_list(options)
cfg.freeze()
if __name__ == "__main__":
import sys
with open(sys.argv[1], "w") as f:
print(_C, file=f)

Просмотреть файл

@ -0,0 +1,10 @@
#!/usr/bin/env bash
wget https://har.blob.core.windows.net/i3dmodels/flow_hmdb_split1.pt
wget https://har.blob.core.windows.net/i3dmodels/rgb_hmdb_split1.pt
wget https://har.blob.core.windows.net/i3dmodels/flow_imagenet_kinetics.pt
wget https://har.blob.core.windows.net/i3dmodels/rgb_imagenet_kinetics.pt
mv flow_hmdb_split1.pt pretrained_models/flow_hmdb_split1.pt
mv rgb_hmdb_split1.pt pretrained_models/rgb_hmdb_split1.pt
mv flow_imagenet_kinetics.pt pretrained_models/flow_imagenet_kinetics.pt
mv rgb_imagenet_kinetics.pt pretrained_models/rgb_imagenet_kinetics.pt

Просмотреть файл

@ -0,0 +1,20 @@
name: i3d
dependencies:
- python=3.6.2
- pandas
- numpy
- ipykernel
- matplotlib
- pip:
- torch==1.2.0
- torchvision
- pillow
- fire
- tensorboardX
- tensorboard
- yacs
- opencv-contrib-python-headless
channels:
- conda-forge
- anaconda

Просмотреть файл

@ -0,0 +1,97 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
from pathlib import Path
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from videotransforms import (
GroupScale, GroupCenterCrop, GroupNormalize, Stack
)
from models.pytorch_i3d import InceptionI3d
from dataset import I3DDataSet
from test import load_model
def load_image(frame_file):
try:
img = Image.open(frame_file).convert('RGB')
return img
except:
print("Couldn't load image:{}".format(frame_file))
return None
def load_frames(frame_paths):
frame_list = []
for frame in frame_paths:
frame_list.append(load_image(frame))
return frame_list
def construct_input(frame_list):
transform = torchvision.transforms.Compose([
GroupScale(config.TRAIN.RESIZE_MIN),
GroupCenterCrop(config.TRAIN.INPUT_SIZE),
GroupNormalize(modality="RGB"),
Stack(),
])
process_data = transform(frame_list)
return process_data.unsqueeze(0)
def predict_input(model, input):
input = input.cuda(non_blocking=True)
output = model(input)
output = torch.mean(output, dim=2)
return output
def predict_over_video(video_frame_list, window_width=9, stride=1):
if window_width < 9:
raise ValueError("window_width must be 9 or greater")
print("Loading model...")
model = load_model(
modality="RGB",
state_dict_file="pretrained_chkpt/rgb_hmdb_split1.pt"
)
model.eval()
print("Predicting actions over {0} frames".format(len(video_frame_list)))
with torch.no_grad():
window_count = 0
for i in range(stride+window_width-1, len(video_frame_list), stride):
window_frame_list = [video_frame_list[j] for j in range(i-window_width, i)]
frames = load_frames(window_frame_list)
batch = construct_input(frames)
window_predictions = predict_input(model, batch)
window_proba = F.softmax(window_predictions, dim=1)
window_top_pred = window_proba.max(1)
print(("Window:{0} Class pred:{1} Class proba:{2}".format(
window_count,
window_top_pred.indices.cpu().numpy()[0],
window_top_pred.values.cpu().numpy()[0])
))
window_count += 1
if __name__ == "__main__":
# Provide list of filepaths to video frames
frame_paths = []
predict_over_video(frame_list, window_width=64, stride=32)

Просмотреть файл

@ -0,0 +1,40 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# From https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/main.py
import torch
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def accuracy(output, target, topk=(1,)):
"""Computes the accuracy over the k top predictions for the specified values of k"""
with torch.no_grad():
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res

Просмотреть файл

@ -0,0 +1,338 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
import os
import sys
from collections import OrderedDict
class MaxPool3dSamePadding(nn.MaxPool3d):
def compute_pad(self, dim, s):
if s % self.stride[dim] == 0:
return max(self.kernel_size[dim] - self.stride[dim], 0)
else:
return max(self.kernel_size[dim] - (s % self.stride[dim]), 0)
def forward(self, x):
# compute 'same' padding
(batch, channel, t, h, w) = x.size()
#print t,h,w
out_t = np.ceil(float(t) / float(self.stride[0]))
out_h = np.ceil(float(h) / float(self.stride[1]))
out_w = np.ceil(float(w) / float(self.stride[2]))
#print out_t, out_h, out_w
pad_t = self.compute_pad(0, t)
pad_h = self.compute_pad(1, h)
pad_w = self.compute_pad(2, w)
#print pad_t, pad_h, pad_w
pad_t_f = pad_t // 2
pad_t_b = pad_t - pad_t_f
pad_h_f = pad_h // 2
pad_h_b = pad_h - pad_h_f
pad_w_f = pad_w // 2
pad_w_b = pad_w - pad_w_f
pad = (pad_w_f, pad_w_b, pad_h_f, pad_h_b, pad_t_f, pad_t_b)
#print x.size()
#print pad
x = F.pad(x, pad)
return super(MaxPool3dSamePadding, self).forward(x)
class Unit3D(nn.Module):
def __init__(self, in_channels,
output_channels,
kernel_shape=(1, 1, 1),
stride=(1, 1, 1),
padding=0,
activation_fn=F.relu,
use_batch_norm=True,
use_bias=False,
name='unit_3d'):
"""Initializes Unit3D module."""
super(Unit3D, self).__init__()
self._output_channels = output_channels
self._kernel_shape = kernel_shape
self._stride = stride
self._use_batch_norm = use_batch_norm
self._activation_fn = activation_fn
self._use_bias = use_bias
self.name = name
self.padding = padding
self.conv3d = nn.Conv3d(in_channels=in_channels,
out_channels=self._output_channels,
kernel_size=self._kernel_shape,
stride=self._stride,
padding=0, # we always want padding to be 0 here. We will dynamically pad based on input size in forward function
bias=self._use_bias)
if self._use_batch_norm:
self.bn = nn.BatchNorm3d(self._output_channels, eps=0.001, momentum=0.01)
def compute_pad(self, dim, s):
if s % self._stride[dim] == 0:
return max(self._kernel_shape[dim] - self._stride[dim], 0)
else:
return max(self._kernel_shape[dim] - (s % self._stride[dim]), 0)
def forward(self, x):
# compute 'same' padding
(batch, channel, t, h, w) = x.size()
#print t,h,w
out_t = np.ceil(float(t) / float(self._stride[0]))
out_h = np.ceil(float(h) / float(self._stride[1]))
out_w = np.ceil(float(w) / float(self._stride[2]))
#print out_t, out_h, out_w
pad_t = self.compute_pad(0, t)
pad_h = self.compute_pad(1, h)
pad_w = self.compute_pad(2, w)
#print pad_t, pad_h, pad_w
pad_t_f = pad_t // 2
pad_t_b = pad_t - pad_t_f
pad_h_f = pad_h // 2
pad_h_b = pad_h - pad_h_f
pad_w_f = pad_w // 2
pad_w_b = pad_w - pad_w_f
pad = (pad_w_f, pad_w_b, pad_h_f, pad_h_b, pad_t_f, pad_t_b)
#print x.size()
#print pad
x = F.pad(x, pad)
#print x.size()
x = self.conv3d(x)
if self._use_batch_norm:
x = self.bn(x)
if self._activation_fn is not None:
x = self._activation_fn(x)
return x
class InceptionModule(nn.Module):
def __init__(self, in_channels, out_channels, name):
super(InceptionModule, self).__init__()
self.b0 = Unit3D(in_channels=in_channels, output_channels=out_channels[0], kernel_shape=[1, 1, 1], padding=0,
name=name+'/Branch_0/Conv3d_0a_1x1')
self.b1a = Unit3D(in_channels=in_channels, output_channels=out_channels[1], kernel_shape=[1, 1, 1], padding=0,
name=name+'/Branch_1/Conv3d_0a_1x1')
self.b1b = Unit3D(in_channels=out_channels[1], output_channels=out_channels[2], kernel_shape=[3, 3, 3],
name=name+'/Branch_1/Conv3d_0b_3x3')
self.b2a = Unit3D(in_channels=in_channels, output_channels=out_channels[3], kernel_shape=[1, 1, 1], padding=0,
name=name+'/Branch_2/Conv3d_0a_1x1')
self.b2b = Unit3D(in_channels=out_channels[3], output_channels=out_channels[4], kernel_shape=[3, 3, 3],
name=name+'/Branch_2/Conv3d_0b_3x3')
self.b3a = MaxPool3dSamePadding(kernel_size=[3, 3, 3],
stride=(1, 1, 1), padding=0)
self.b3b = Unit3D(in_channels=in_channels, output_channels=out_channels[5], kernel_shape=[1, 1, 1], padding=0,
name=name+'/Branch_3/Conv3d_0b_1x1')
self.name = name
def forward(self, x):
b0 = self.b0(x)
b1 = self.b1b(self.b1a(x))
b2 = self.b2b(self.b2a(x))
b3 = self.b3b(self.b3a(x))
return torch.cat([b0,b1,b2,b3], dim=1)
class InceptionI3d(nn.Module):
"""Inception-v1 I3D architecture.
The model is introduced in:
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Joao Carreira, Andrew Zisserman
https://arxiv.org/pdf/1705.07750v1.pdf.
See also the Inception architecture, introduced in:
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich.
http://arxiv.org/pdf/1409.4842v1.pdf.
"""
# Endpoints of the model in order. During construction, all the endpoints up
# to a designated `final_endpoint` are returned in a dictionary as the
# second return value.
VALID_ENDPOINTS = (
'Conv3d_1a_7x7',
'MaxPool3d_2a_3x3',
'Conv3d_2b_1x1',
'Conv3d_2c_3x3',
'MaxPool3d_3a_3x3',
'Mixed_3b',
'Mixed_3c',
'MaxPool3d_4a_3x3',
'Mixed_4b',
'Mixed_4c',
'Mixed_4d',
'Mixed_4e',
'Mixed_4f',
'MaxPool3d_5a_2x2',
'Mixed_5b',
'Mixed_5c',
'Logits',
'Predictions',
)
def __init__(self, num_classes=400, spatial_squeeze=True,
final_endpoint='Logits', name='inception_i3d', in_channels=3, dropout_keep_prob=0.5):
"""Initializes I3D model instance.
Args:
num_classes: The number of outputs in the logit layer (default 400, which
matches the Kinetics dataset).
spatial_squeeze: Whether to squeeze the spatial dimensions for the logits
before returning (default True).
final_endpoint: The model contains many possible endpoints.
`final_endpoint` specifies the last endpoint for the model to be built
up to. In addition to the output at `final_endpoint`, all the outputs
at endpoints up to `final_endpoint` will also be returned, in a
dictionary. `final_endpoint` must be one of
InceptionI3d.VALID_ENDPOINTS (default 'Logits').
name: A string (optional). The name of this module.
Raises:
ValueError: if `final_endpoint` is not recognized.
"""
if final_endpoint not in self.VALID_ENDPOINTS:
raise ValueError('Unknown final endpoint %s' % final_endpoint)
super(InceptionI3d, self).__init__()
self._num_classes = num_classes
self._spatial_squeeze = spatial_squeeze
self._final_endpoint = final_endpoint
self.logits = None
if self._final_endpoint not in self.VALID_ENDPOINTS:
raise ValueError('Unknown final endpoint %s' % self._final_endpoint)
self.end_points = {}
end_point = 'Conv3d_1a_7x7'
self.end_points[end_point] = Unit3D(in_channels=in_channels, output_channels=64, kernel_shape=[7, 7, 7],
stride=(2, 2, 2), padding=(3,3,3), name=name+end_point)
if self._final_endpoint == end_point: return
end_point = 'MaxPool3d_2a_3x3'
self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[1, 3, 3], stride=(1, 2, 2),
padding=0)
if self._final_endpoint == end_point: return
end_point = 'Conv3d_2b_1x1'
self.end_points[end_point] = Unit3D(in_channels=64, output_channels=64, kernel_shape=[1, 1, 1], padding=0,
name=name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Conv3d_2c_3x3'
self.end_points[end_point] = Unit3D(in_channels=64, output_channels=192, kernel_shape=[3, 3, 3], padding=1,
name=name+end_point)
if self._final_endpoint == end_point: return
end_point = 'MaxPool3d_3a_3x3'
self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[1, 3, 3], stride=(1, 2, 2),
padding=0)
if self._final_endpoint == end_point: return
end_point = 'Mixed_3b'
self.end_points[end_point] = InceptionModule(192, [64,96,128,16,32,32], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_3c'
self.end_points[end_point] = InceptionModule(256, [128,128,192,32,96,64], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'MaxPool3d_4a_3x3'
self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[3, 3, 3], stride=(2, 2, 2),
padding=0)
if self._final_endpoint == end_point: return
end_point = 'Mixed_4b'
self.end_points[end_point] = InceptionModule(128+192+96+64, [192,96,208,16,48,64], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_4c'
self.end_points[end_point] = InceptionModule(192+208+48+64, [160,112,224,24,64,64], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_4d'
self.end_points[end_point] = InceptionModule(160+224+64+64, [128,128,256,24,64,64], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_4e'
self.end_points[end_point] = InceptionModule(128+256+64+64, [112,144,288,32,64,64], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_4f'
self.end_points[end_point] = InceptionModule(112+288+64+64, [256,160,320,32,128,128], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'MaxPool3d_5a_2x2'
self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[2, 2, 2], stride=(2, 2, 2),
padding=0)
if self._final_endpoint == end_point: return
end_point = 'Mixed_5b'
self.end_points[end_point] = InceptionModule(256+320+128+128, [256,160,320,32,128,128], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Mixed_5c'
self.end_points[end_point] = InceptionModule(256+320+128+128, [384,192,384,48,128,128], name+end_point)
if self._final_endpoint == end_point: return
end_point = 'Logits'
self.avg_pool = nn.AvgPool3d(kernel_size=[2, 7, 7],
stride=(1, 1, 1))
self.dropout = nn.Dropout(dropout_keep_prob)
self.logits = Unit3D(in_channels=384+384+128+128, output_channels=self._num_classes,
kernel_shape=[1, 1, 1],
padding=0,
activation_fn=None,
use_batch_norm=False,
use_bias=True,
name='logits')
self.build()
def replace_logits(self, num_classes):
self._num_classes = num_classes
self.logits = Unit3D(in_channels=384+384+128+128, output_channels=self._num_classes,
kernel_shape=[1, 1, 1],
padding=0,
activation_fn=None,
use_batch_norm=False,
use_bias=True,
name='logits')
def build(self):
for k in self.end_points.keys():
self.add_module(k, self.end_points[k])
def forward(self, x):
for end_point in self.VALID_ENDPOINTS:
if end_point in self.end_points:
x = self._modules[end_point](x) # use _modules to work with dataparallel
x = self.logits(self.dropout(self.avg_pool(x)))
if self._spatial_squeeze:
logits = x.squeeze(3).squeeze(3)
# logits is batch X time X classes, which is what we want to work with
return logits
def extract_features(self, x):
for end_point in self.VALID_ENDPOINTS:
if end_point in self.end_points:
x = self._modules[end_point](x)
return self.avg_pool(x)

Просмотреть файл

@ -0,0 +1,170 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
import time
import sys
import numpy as np
import fire
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from videotransforms import (
GroupScale, GroupCenterCrop, GroupNormalize, Stack
)
from models.pytorch_i3d import InceptionI3d
from metrics import accuracy, AverageMeter
from dataset import I3DDataSet
from default import _C as config
from default import update_config
# to work with vscode debugger https://github.com/joblib/joblib/issues/864
import multiprocessing
multiprocessing.set_start_method('spawn', True)
def load_model(modality, state_dict_file):
channels = 3 if modality == "RGB" else 2
model = InceptionI3d(config.DATASET.NUM_CLASSES, in_channels=channels)
state_dict = torch.load(state_dict_file)
model.load_state_dict(state_dict)
model = model.cuda()
return model
def test(model, test_loader, modality):
model.eval()
target_list = []
predictions_list = []
with torch.no_grad():
end = time.time()
for step, (input, target) in enumerate(test_loader):
target_list.append(target)
input = input.cuda(non_blocking=True)
# compute output
output = model(input)
output = torch.mean(output, dim=2)
predictions_list.append(output)
if step % config.TEST.PRINT_FREQ == 0:
print(('Step: [{0}/{1}]'.format(step, len(test_loader))))
targets = torch.cat(target_list)
predictions = torch.cat(predictions_list)
return targets, predictions
def run(*options, cfg=None):
update_config(config, options=options, config_file=cfg)
torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
if torch.cuda.is_available():
torch.cuda.manual_seed_all(config.SEED)
np.random.seed(seed=config.SEED)
<<<<<<< HEAD
=======
# Setup Augmentation/Transformation pipeline
>>>>>>> f3bf37106de4839ab03af79636bdf83b4d7dfa52
input_size = config.TRAIN.INPUT_SIZE
resize_range_min = config.TRAIN.RESIZE_MIN
# Data-parallel
devices_lst = list(range(torch.cuda.device_count()))
print("Devices {}".format(devices_lst))
if (config.TEST.MODALITY == "RGB") or (config.TEST.MODALITY == "combined"):
rgb_loader = torch.utils.data.DataLoader(
I3DDataSet(
data_root=config.DATASET.DIR,
split=config.DATASET.SPLIT,
modality="RGB",
train_mode=False,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(config.TRAIN.RESIZE_MIN),
GroupCenterCrop(config.TRAIN.INPUT_SIZE),
GroupNormalize(modality="RGB"),
Stack(),
])
),
batch_size=config.TEST.BATCH_SIZE,
shuffle=False,
num_workers=config.WORKERS,
pin_memory=config.PIN_MEMORY
)
rgb_model_file = config.TEST.MODEL_RGB
if not os.path.exists(rgb_model_file):
raise FileNotFoundError(rgb_model_file, " does not exist")
rgb_model = load_model(modality="RGB", state_dict_file=rgb_model_file)
print("scoring with rgb model")
targets, rgb_predictions = test(rgb_model, rgb_loader, "RGB")
del rgb_model
targets = targets.cuda(non_blocking=True)
rgb_top1_accuracy = accuracy(rgb_predictions, targets, topk=(1, ))
print("rgb top1 accuracy: ", rgb_top1_accuracy[0].cpu().numpy().tolist())
if (config.TEST.MODALITY == "flow") or (config.TEST.MODALITY == "combined"):
flow_loader = torch.utils.data.DataLoader(
I3DDataSet(
data_root=config.DATASET.DIR,
split=config.DATASET.SPLIT,
modality="flow",
train_mode=False,
sample_frames_at_test=False,
transform=torchvision.transforms.Compose([
GroupScale(config.TRAIN.RESIZE_MIN),
GroupCenterCrop(config.TRAIN.INPUT_SIZE),
GroupNormalize(modality="flow"),
Stack(),
])
),
batch_size=config.TEST.BATCH_SIZE,
shuffle=False,
num_workers=config.WORKERS,
pin_memory=config.PIN_MEMORY
)
flow_model_file = config.TEST.MODEL_FLOW
if not os.path.exists(flow_model_file):
raise FileNotFoundError(flow_model_file, " does not exist")
flow_model = load_model(modality="flow", state_dict_file=flow_model_file)
print("scoring with flow model")
targets, flow_predictions = test(flow_model, flow_loader, "flow")
del flow_model
targets = targets.cuda(non_blocking=True)
flow_top1_accuracy = accuracy(flow_predictions, targets, topk=(1, ))
print("flow top1 accuracy: ", flow_top1_accuracy[0].cpu().numpy().tolist())
if config.TEST.MODALITY == "combined":
predictions = torch.stack([rgb_predictions, flow_predictions])
predictions_mean = torch.mean(predictions, dim=0)
top1accuracy = accuracy(predictions_mean, targets, topk=(1, ))
print("combined top1 accuracy: ", top1accuracy[0].cpu().numpy().tolist())
if __name__ == "__main__":
fire.Fire(run)

Просмотреть файл

@ -0,0 +1,278 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
import time
import sys
import numpy as np
import fire
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.autograd import Variable
import torchvision
from torchvision import datasets, transforms
from tensorboardX import SummaryWriter
from default import _C as config
from default import update_config
from videotransforms import (
GroupRandomCrop, GroupRandomHorizontalFlip,
GroupScale, GroupCenterCrop, GroupNormalize, Stack
)
from models.pytorch_i3d import InceptionI3d
from metrics import accuracy, AverageMeter
from dataset import I3DDataSet
# to work with vscode debugger https://github.com/joblib/joblib/issues/864
import multiprocessing
multiprocessing.set_start_method('spawn', True)
def train(train_loader, model, criterion, optimizer, epoch, writer=None):
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()
# switch to train mode
model.train()
end = time.time()
for step, (input, target) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
input = input.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
# compute output
output = model(input)
output = torch.mean(output, dim=2)
loss = criterion(output, target)
# measure accuracy and record loss
prec1, prec5 = accuracy(output, target, topk=(1,5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
loss = loss / config.TRAIN.GRAD_ACCUM_STEPS
loss.backward()
if step % config.TRAIN.GRAD_ACCUM_STEPS == 0:
optimizer.step()
optimizer.zero_grad()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if step % config.TRAIN.PRINT_FREQ == 0:
print(('Epoch: [{0}][{1}/{2}], lr: {lr:.5f}\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
epoch, step, len(train_loader), batch_time=batch_time,
data_time=data_time, loss=losses, top1=top1, top5=top5, lr=optimizer.param_groups[-1]['lr'])))
if writer:
writer.add_scalar('train/loss', losses.avg, epoch+1)
writer.add_scalar('train/top1', top1.avg, epoch+1)
writer.add_scalar('train/top5', top5.avg, epoch+1)
def validate(val_loader, model, criterion, epoch, writer=None):
batch_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()
# switch to evaluate mode
model.eval()
with torch.no_grad():
end = time.time()
for step, (input, target) in enumerate(val_loader):
input = input.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
# compute output
output = model(input)
output = torch.mean(output, dim=2)
loss = criterion(output, target)
# measure accuracy and record loss
prec1, prec5 = accuracy(output, target, topk=(1,5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if step % config.TEST.PRINT_FREQ == 0:
print(('Test: [{0}/{1}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
step, len(val_loader), batch_time=batch_time, loss=losses,
top1=top1, top5=top5)))
print(('Testing Results: Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f} Loss {loss.avg:.5f}'
.format(top1=top1, top5=top5, loss=losses)))
if writer:
writer.add_scalar('val/loss', losses.avg, epoch+1)
writer.add_scalar('val/top1', top1.avg, epoch+1)
writer.add_scalar('val/top5', top5.avg, epoch+1)
return losses.avg
def run(*options, cfg=None):
"""Run training and validation of model
Notes:
Options can be passed in via the options argument and loaded from the cfg file
Options loaded from default.py will be overridden by options loaded from cfg file
Options passed in through options argument will override option loaded from cfg file
Args:
*options (str,int ,optional): Options used to overide what is loaded from the config.
To see what options are available consult default.py
cfg (str, optional): Location of config file to load. Defaults to None.
"""
update_config(config, options=options, config_file=cfg)
print("Training ", config.TRAIN.MODALITY, " model.")
print("Batch size:", config.TRAIN.BATCH_SIZE, " Gradient accumulation steps:", config.TRAIN.GRAD_ACCUM_STEPS)
torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
torch.manual_seed(config.SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(config.SEED)
np.random.seed(seed=config.SEED)
# Log to tensorboard
writer = SummaryWriter(log_dir=config.LOG_DIR)
# Setup dataloaders
train_loader = torch.utils.data.DataLoader(
I3DDataSet(
data_root=config.DATASET.DIR,
split=config.DATASET.SPLIT,
sample_frames=config.TRAIN.SAMPLE_FRAMES,
modality=config.TRAIN.MODALITY,
transform=torchvision.transforms.Compose([
GroupScale(config.TRAIN.RESIZE_MIN),
GroupRandomCrop(config.TRAIN.INPUT_SIZE),
GroupRandomHorizontalFlip(),
GroupNormalize(modality=config.TRAIN.MODALITY),
Stack(),
])
),
batch_size=config.TRAIN.BATCH_SIZE,
shuffle=True,
num_workers=config.WORKERS,
pin_memory=config.PIN_MEMORY
)
val_loader = torch.utils.data.DataLoader(
I3DDataSet(
data_root=config.DATASET.DIR,
split=config.DATASET.SPLIT,
modality=config.TRAIN.MODALITY,
train_mode=False,
transform=torchvision.transforms.Compose([
GroupScale(config.TRAIN.RESIZE_MIN),
GroupCenterCrop(config.TRAIN.INPUT_SIZE),
GroupNormalize(modality=config.TRAIN.MODALITY),
Stack(),
]),
),
batch_size=config.TEST.BATCH_SIZE,
shuffle=False,
num_workers=config.WORKERS,
pin_memory=config.PIN_MEMORY
)
# Setup model
if config.TRAIN.MODALITY == "RGB":
channels = 3
checkpoint = config.MODEL.PRETRAINED_RGB
elif config.TRAIN.MODALITY == "flow":
channels = 2
checkpoint = config.MODEL.PRETRAINED_FLOW
else:
raise ValueError("Modality must be RGB or flow")
i3d_model = InceptionI3d(400, in_channels=channels)
i3d_model.load_state_dict(torch.load(checkpoint))
# Replace final FC layer to match dataset
i3d_model.replace_logits(config.DATASET.NUM_CLASSES)
criterion = torch.nn.CrossEntropyLoss().cuda()
optimizer = optim.SGD(
i3d_model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=0.0000001
)
i3d_model = i3d_model.cuda()
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
factor=0.1,
patience=2,
verbose=True,
threshold=1e-4,
min_lr=1e-4
)
# Data-parallel
devices_lst = list(range(torch.cuda.device_count()))
print("Devices {}".format(devices_lst))
if len(devices_lst) > 1:
i3d_model = torch.nn.DataParallel(i3d_model)
if not os.path.exists(config.MODEL.CHECKPOINT_DIR):
os.makedirs(config.MODEL.CHECKPOINT_DIR)
for epoch in range(config.TRAIN.MAX_EPOCHS):
train(train_loader,
i3d_model,
criterion,
optimizer,
epoch,
writer
)
if (epoch + 1) % config.TEST.EVAL_FREQ == 0 or epoch == config.TRAIN.MAX_EPOCHS - 1:
val_loss = validate(val_loader, i3d_model, criterion, epoch, writer)
scheduler.step(val_loss)
torch.save(
i3d_model.module.state_dict(),
config.MODEL.CHECKPOINT_DIR+'/'+config.MODEL.NAME+'_split'+str(config.DATASET.SPLIT)+'_epoch'+str(epoch).zfill(3)+'.pt'
)
writer.close()
if __name__ == "__main__":
fire.Fire(run)

Просмотреть файл

@ -0,0 +1,98 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# Adapted from https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/transforms.py
import torchvision
import random
from PIL import Image, ImageOps
import numpy as np
import numbers
import math
import torch
class GroupScale(object):
def __init__(self, size, interpolation=Image.BILINEAR):
self.worker = torchvision.transforms.Resize(size, interpolation)
def __call__(self, img_group):
return [self.worker(img) for img in img_group]
class GroupRandomCrop(object):
def __init__(self, size):
if isinstance(size, numbers.Number):
self.size = (int(size), int(size))
else:
self.size = size
def __call__(self, img_group):
w, h = img_group[0].size
th, tw = self.size
out_images = list()
x1 = random.randint(0, w - tw)
y1 = random.randint(0, h - th)
for img in img_group:
assert(img.size[0] == w and img.size[1] == h)
if w == tw and h == th:
out_images.append(img)
else:
out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
return out_images
class GroupCenterCrop(object):
def __init__(self, size):
self.worker = torchvision.transforms.CenterCrop(size)
def __call__(self, img_group):
cropped_imgs = [self.worker(img) for img in img_group]
return cropped_imgs
class GroupRandomHorizontalFlip(object):
def __call__(self, img_group):
v = random.random()
if v < 0.5:
ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
return ret
else:
return img_group
class GroupNormalize(object):
def __init__(self, modality, means=[0.485, 0.456, 0.406], stds=[0.229, 0.224, 0.225]):
self.modality = modality
self.means = means
self.stds = stds
self.tensor_worker = torchvision.transforms.ToTensor()
self.norm_worker = torchvision.transforms.Normalize(mean=means, std=stds)
def __call__(self, img_group):
if self.modality == "RGB":
# Convert images to tensors in range [0, 1]
img_tensors = [self.tensor_worker(img) for img in img_group]
# Normalize to imagenet means and stds
img_tensors = [self.norm_worker(img) for img in img_tensors]
else:
# Convert images to numpy arrays
img_arrays = [np.asarray(img).transpose([2, 0, 1]) for img in img_group]
# Scale to [-1, 1] and convert to tensor
img_tensors = [torch.from_numpy((img / 255.) * 2 - 1) for img in img_arrays]
return img_tensors
class Stack(object):
def __call__(self, img_tensors):
# Stack tensors and permute from D x C x H x W to C x D x H x W
return torch.stack(img_tensors, dim=0).permute(1, 0, 2, 3).float()

Двоичные данные
contrib/action_recognition/media/action_recognition.gif Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 8.5 MiB

Двоичные данные
contrib/action_recognition/media/inference_speeds.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 221 KiB

Двоичные данные
contrib/action_recognition/media/publications.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 459 KiB

Просмотреть файл

@ -0,0 +1,53 @@
# Video Annotation Summary For Action Recognition
To create a training or evaluation set for action recognition, the ground truth start/end position of actions in videos needs to be annotated. We looked into various tools for this and the tool we liked most (by far) is called [VGG Image Annotator (VIA)](http://www.robots.ox.ac.uk/~vgg/software/via/) written by the VGG group at Oxford.
## Instructions For Using VIA Tool
We will now provide a few tips/steps how to use the VIA tool. A fully functioning live demo of the tool can be found [here](http://www.robots.ox.ac.uk/~vgg/software/via/demo/via_video_annotator.html).
![](./media/fig3.png)
<p align="center">Screenshot of VIA Tool</p>
How to use the tool for action recognition:
- Step 1: Download the zip file from the link [here](http://www.robots.ox.ac.uk/~vgg/software/via/downloads/via-1.0.6.zip).
- Step 2: Unzip the tool and open the *via_video_annotator.html* to open the annotation tool. *Note: support for some browsers seems not fully stable - we found Chrome to work best.*
- Step 3: Import the video file(s) from local using ![](./media/fig4.png) or from url using ![](./media/fig5.png).
- Step 4: Use ![](./media/fig1.png) to create a new attribute for action annotation. Select *Temporal Segment in Video or Audio* for *Anchor*. To see the created attribute, click ![](./media/fig1.png) again.
- Step 5: Update the *Timeline List* with the actions you want to track. Split different actions by e.g "1. eat, 2. drink" for two tracks separately for *eat* and *drink*. Click *update* to see the updated tracks.
- Step 6: Click on one track to add segment annotations for a certain action. Use key `a` to add the temporal segment at the current time and `Shift + a` to update the edge of the temporal segment to the current time.
- Step 7: Export the annotations using ![](./media/fig2.png). Select *Only Temporal Segments as CSV* if you only have temporal segments annotations.
## Scripts for use with the VIA Tool
The VIA tool outputs annotations as a csv file. Often however we need each annotated action to be written as its own clip and into separate files. For this, we provide some utility functions (in this folder) which help with:
- Extraction of each action as "positive" clip, and "negative" clips defined as video segments where no action-of-interest occurs.
- Conversion of the video clips to a format which the VIA tool knows how to read.
## Annotation Tools Comparison
Below is a list of alternative UIs for annotating actions, however in our opinion the VIA tool is the by far best performer. We distinguish between:
- Fixed-length clip annotation: where the UI splits the video into fixed-length clips, and the user then annotates the clips.
- Segmentations annotation: where the user annotates the exact start and end position of each action directly. This is more time-consuming compared to fixed-length clip annotation, however comes with higher localization accuracy.
See also the [HACS Dataset web page](http://hacs.csail.mit.edu/index.html#explore) for some examples showing these two types of annotations.
| Tool Name | Annotation Type |Pros|Cons|Whether Open Source|
| ----------- | ----------- |----------- |----------- |----------- |
| [MuViLab](https://github.com/ale152/muvilab) | Fixed-length clips annotation | <ul><li> Accelerate clip annotation by displaying many clips at the same time</li><br><li> Especially helpful when the actions are sparse</li></ul>| <ul><li> Not useful when the actions are very short (eg a second)</li></ul>|Open source on Github|
| [VIA (VGG Image Annotator)](http://www.robots.ox.ac.uk/~vgg/software/via/) | segmentations annotation|<ul><li> Light-weight, no prerequisite besides downloading a zip file</li> <br> <li> Actively developed Gitlab project </li><br> <li> Support for: annotating video in high precision(on milliseconds and frame), previewing the annotated clips, export start and end time of the actions to csv, annotating multiple actions in different track on the same video </li><br> <li> Easy to ramp-up and use</li></ul>|<ul><li> Code can be instabilities, e.g sometimes the tool becomes unresponsive.</li></ul>|Open source on Gitlab|
|[ANVIL](http://www.anvil-software.org/#)|Segmentations annotation|<ul> <li> Support for high precision annotation, export the start and end time.</li></ul>| <ul><li> Heavier prerequisite with Java required </li><br> <li> Harder to ramp-up compared to VIA with lots of specifications, etc. </li><br> <li> Java-related issues can make the tool difficult to run. </li></ul>|Not open source, but free to download|
|[Action Annotation Tool](https://github.com/devyhia/action-annotation)| Segmentations annotation|<ul><li> Add labels to key frames in video</li> <br> <li> Support high precision to milliseconds</li></ul>|<ul><li> Much less convenient compared to VIA or ANVIL</li> <br> <li> Not in active delevepment</li></ul>|Open source on Github|
## References
- [Deep Learning for Videos: A 2018 Guide to Action Recognition.](http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review#targetText=Action%20recognition%20by%20dense%20trajectories,Trajectories%20by%20Wang%20et%20al)
- [Zhao, H., et al. "Hacs: Human action clips and segments dataset for recognition and temporal localization." arXiv preprint arXiv:1712.09374 (2019).](https://arxiv.org/abs/1712.09374)
- [Kay, Will, et al. "The kinetics human action video dataset." arXiv preprint arXiv:1705.06950 (2017).](https://arxiv.org/abs/1705.06950)
- [Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM 19), October 21–25, 2019, Nice, France. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3343031.3350535.](https://arxiv.org/abs/1904.10699)

Просмотреть файл

@ -0,0 +1,148 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# prerequisite:
# (1) download and extract ffmpeg: https://github.com/adaptlearning/adapt_authoring/wiki/Installing-FFmpeg
# (2) make sure the ffmpeg is in your system's env variable: path
# the script depend on the following fixed things of the csv
# skiprows=1
import argparse
import ast
import os
import sys
import pandas as pd
sys.path.append("lib")
from video_annotation_utils import (
create_clip_file_name,
get_clip_action_label,
extract_clip,
extract_negative_samples_per_file,
)
def main(
annotation_filepath,
has_header,
video_dir,
clip_dir,
label_filepath,
clip_format,
clip_margin,
clip_length,
):
# set pandas display
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)
if has_header:
skiprows = 1
else:
skiprows = 0
# read in the start time and end time of the clips while removing the records with no label related
video_info_df = pd.read_csv(annotation_filepath, skiprows=skiprows)
video_info_df = video_info_df.loc[video_info_df["metadata"] != "{}"]
# create clip file name and label
video_info_df["clip_file_name"] = video_info_df.apply(
lambda x: create_clip_file_name(x, clip_file_format=clip_format),
axis=1,
)
video_info_df["clip_action_label"] = video_info_df.apply(
lambda x: get_clip_action_label(x), axis=1
)
# remove the clips with action label as '_DEFAULT'
video_info_df = video_info_df.loc[
video_info_df["clip_action_label"] != "_DEFAULT"
]
# script-input
video_info_df.apply(lambda x: extract_clip(x, video_dir, clip_dir), axis=1)
# write the label
video_info_df[["clip_file_name", "clip_action_label"]].to_csv(
label_filepath, index=False
)
# Extract negative samples
# add column with file name
video_info_df["video_file"] = video_info_df.apply(
lambda x: ast.literal_eval(x.file_list)[0], axis=1
)
negative_clip_dir = os.path.join(clip_dir, "negative_samples")
video_file_list = list(video_info_df["video_file"].unique())
negative_sample_info_df = pd.DataFrame()
for video_file in video_file_list:
res_df = extract_negative_samples_per_file(
video_file,
video_dir,
video_info_df,
negative_clip_dir,
clip_format,
ignore_clip_length=clip_margin,
clip_length=clip_length,
skip_clip_length=clip_margin,
)
negative_sample_info_df = negative_sample_info_df.append(res_df)
negative_sample_info_df.to_csv(
os.path.join(negative_clip_dir, "negative_clip_info.csv"), index=False
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-A", "--annotation_filepath", help="CSV filepath from the annotator"
)
parser.add_argument(
"-H",
"--has_header",
help="Set if the annotation file has header",
action="store_true",
)
parser.add_argument("-I", "--input_dir", help="Input video dir")
parser.add_argument(
"-O",
"--output_dir",
help="Output dir where the extracted clips will be stored",
default="./outputs",
)
parser.add_argument(
"-L",
"--label_filepath",
help="Path where the label csv will be stored",
default="./outputs/labels.csv",
)
parser.add_argument("-F", "--clip_format", default="mp4")
parser.add_argument(
"-M",
"--clip_margin",
type=float,
help="The length around the positive samples to be ignored for negative sampling",
default=3.0,
)
parser.add_argument(
"-T",
"--clip_length",
type=float,
help="The length of negative samples to extract",
default=2.0,
)
args = parser.parse_args()
main(
annotation_filepath=args.annotation_filepath,
has_header=args.has_header,
video_dir=args.input_dir,
clip_dir=args.output_dir,
label_filepath=args.label_filepath,
clip_format=args.clip_format,
clip_margin=args.clip_margin,
clip_length=args.clip_length,
)

Просмотреть файл

@ -0,0 +1,433 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import ast
import os
import numpy as np
import pandas as pd
# transform the encoded video:
def video_format_conversion(video_path, output_path, h264_format=False):
"""
Encode video in a different format.
:param video_path: str.
Path to input video
:param output_path: str.
Path where converted video will be written to.
:param h264_format: boolean.
Set to true to save time if input is in h264_format.
:return: None.
"""
if not h264_format:
subprocess.run(
[
"ffmpeg",
"-i",
video_path,
"-c",
"copy",
"-map",
"0",
output_path,
]
)
else:
subprocess.run(
["ffmpeg", "-i", video_path, "-vcodec", "libx264", output_path]
)
def create_clip_file_name(row, clip_file_format="mp4"):
"""
Create the output clip file name.
:param row: pandas.Series.
One row of the video annotation output from the VIA tool.
This function requires the output from VIA tool contains a column '# CSV_HEADER = metadata_id'.
:param clip_file_format: str.
The format of the output clip file.
:return: str.
The output clip file name.
"""
video_file = ast.literal_eval(row.file_list)[0]
clip_id = row["# CSV_HEADER = metadata_id"]
clip_file = "{}_{}.{}".format(video_file, clip_id, clip_file_format)
return clip_file
def get_clip_action_label(row):
"""
Get the action label of the positive clips.
This function requires the output from VIA tool contains a column 'metadata'.
:param row: pandas.Series.
One row of the video annotation output.
:return: str.
"""
label_dict = ast.literal_eval(row.metadata)
track_key = list(label_dict.keys())[0]
return label_dict[track_key]
def _extract_clip_ffmpeg(
start_time, duration, video_path, clip_path, ffmpeg_path=None
):
"""
Using ffmpeg to extract clip from the video based on the start time and duration of the clip.
:param start_time: float.
The start time of the clip.
:param duration: float.
The duration of the clip.
:param video_path: str.
The path of the input video.
:param clip_path: str.
The path of the output clip.
:param ffmpeg_path: str.
The path of the ffmpeg. This is optional, which you could use when you have not added the
ffmpeg to the path environment variable.
:return: None.
"""
subprocess.run(
[
os.path.join(ffmpeg_path, "ffmpeg")
if ffmpeg_path is not None
else "ffmpeg",
"-ss",
str(start_time),
"-i",
video_path,
"-t",
str(duration),
clip_path,
"-codec",
"copy",
"-y",
]
)
def extract_clip(row, video_dir, clip_dir, ffmpeg_path=None):
"""
Extract the postivie clip based on a row of the output annotation file.
:param row: pandas.Series.
One row of the video annotation output.
:param video_dir: str.
The directory of the input videos.
:param clip_dir: str.
The directory of the output positive clips.
:param ffmpeg_path: str.
The path of the ffmpeg. This is optional, which you could use when you have not added the
ffmpeg to the path environment variable.
:return: None.
"""
if not os.path.exists(clip_dir):
os.makedirs(clip_dir)
# there are two different output of the csv from the VIA showing the annotation start and end
# (1) in two columns: temporal_segment_start and temporal_segment_end
# (2) in one column: temporal_coordinates
if "temporal_segment_start" in row.index:
start_time = row.temporal_segment_start
if "temporal_segment_end" not in row.index:
raise ValueError(
"There is no column named 'temporal_segment_end'. Cannot get the full details "
"of the action temporal intervals."
)
end_time = row.temporal_segment_end
elif "temporal_coordinates" in row.index:
start_time, end_time = ast.literal_eval(row.temporal_coordinates)
else:
raise Exception("There is no temporal information in the csv.")
clip_sub_dir = os.path.join(clip_dir, row.clip_action_label)
if not os.path.exists(clip_sub_dir):
os.makedirs(clip_sub_dir)
duration = end_time - start_time
video_file = ast.literal_eval(row.file_list)[0]
video_path = os.path.join(video_dir, video_file)
clip_file = row.clip_file_name
clip_path = os.path.join(clip_sub_dir, clip_file)
if not os.path.exists(video_path):
raise ValueError(
"The video path '{}' is not valid.".format(video_path)
)
# ffmpeg -ss 9.222 -i youtube.mp4 -t 0.688 tmp.mp4 -codec copy -y
_extract_clip_ffmpeg(
start_time, duration, video_path, clip_path, ffmpeg_path
)
def get_video_length(video_file_path):
"""
Get the video length in milliseconds.
:param video_file_path: str.
The path of the video file.
:return: (str, str).
Tuple of video duration (in string), and error message of the ffprobe command if any.
"""
cmd_list = [
"ffprobe",
"-v",
"error",
"-show_entries",
"format=duration",
"-of",
"default=noprint_wrappers=1:nokey=1",
video_file_path,
]
result = subprocess.run(
cmd_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
if len(result.stderr) > 0:
raise RuntimeError(result.stderr)
return float(result.stdout)
def _merge_temporal_interval(temporal_interval_list):
"""
Merge the temporal intervals in the input temporal interval list. This is for situations
when different actions have overlap temporal interval. e.g if the input temporal interval list
is [(1.0, 3.0), (2.0, 4.0)], then [(1.0, 4.0)] will be returned.
:param temporal_interval_list: list of tuples.
List of tuples with (temporal interval start time, temporal interval end time).
:return: list of tuples.
The merged temporal interval list.
"""
# sort by the temporal interval start
temporal_interval_list_sorted = sorted(
temporal_interval_list, key=lambda x: x[0]
)
i = 0
while i < len(temporal_interval_list_sorted) - 1:
a1, b1 = temporal_interval_list_sorted[i]
a2, b2 = temporal_interval_list_sorted[i + 1]
if a2 <= b1:
del temporal_interval_list_sorted[i]
temporal_interval_list_sorted[i] = [a1, max(b1, b2)]
else:
i += 1
return temporal_interval_list_sorted
def _split_interval(
interval,
left_ignore_clip_length,
right_ignore_clip_length,
clip_length,
skip_clip_length=0,
):
"""
Split the negative sample interval into the sub-intervals which will serve as the start and end of
the negative sample clips.
:param interval: tuple of (float, float).
Tuple of start and end of the negative sample interval.
:param left_ignore_clip_length: float.
The clip length to ignore in the left/start of the interval. This is used to avoid creating
negative sample clips with edges too close to positive samples. The same applies to right_ignore_clip_length.
:param right_ignore_clip_length: float.
The clip length to ignore in the right/end of the interval.
:param clip_length: float.
The clip length of the created negative clips.
:param skip_clip_length: float.
The skipped video length between two negative samples, this can be used to reduce the
number of the negative samples.
:return: list of tuples.
List of start and end time of the negative clips.
"""
left, right = interval
if (left_ignore_clip_length + right_ignore_clip_length) >= (right - left):
return []
new_left = left + left_ignore_clip_length
new_right = right - right_ignore_clip_length
if new_right - new_left < clip_length:
return []
interval_start_list = np.arange(
new_left, new_right, clip_length + skip_clip_length
)
interval_end_list = interval_start_list + clip_length
if interval_end_list[-1] > new_right:
interval_start_list = interval_start_list[:-1]
interval_end_list = interval_end_list[:-1]
res = list(zip(list(interval_start_list), list(interval_end_list)))
return res
def _split_interval_list(
interval_list,
left_ignore_clip_length,
right_ignore_clip_length,
clip_length,
skip_clip_length=0,
):
"""
Taking the interval list of the eligible negative sample time intervals, return the list of the
start time and the end time of the negative clips.
:param interval_list: list of tuples.
List of the tuples containing the start time and end time of the eligible negative
sample time intervals.
:param left_ignore_clip_length: float.
See split_interval.
:param right_ignore_clip_length: float.
See split_interval.
:param clip_length: float.
See split_interval.
:param skip_clip_length: float.
See split_interval
:return: list of tuples.
List of start and end time of the negative clips.
"""
interval_res = []
for i in range(len(interval_list)):
interval_res += _split_interval(
interval_list[i],
left_ignore_clip_length=left_ignore_clip_length,
right_ignore_clip_length=right_ignore_clip_length,
clip_length=clip_length,
skip_clip_length=skip_clip_length,
)
return interval_res
def extract_negative_samples_per_file(
video_file,
video_dir,
video_info_df,
negative_clip_dir,
clip_file_format,
ignore_clip_length,
clip_length,
ffmpeg_path=None,
skip_clip_length=0,
):
"""
Extract the negative sample for a single video file.
:param video_file: str.
The name of the input video file.
:param video_dir: str.
The directory of the input video.
:param video_info_df: pandas.DataFrame.
The data frame which contains the video annotation output.
:param negative_clip_dir: str.
The directory of the output negative clips.
:param clip_file_format: str.
The format of the output negative clips.
:param ignore_clip_length: float.
The clip length to ignore in the left/start of the interval. This is used to avoid creating
negative sample clips with edges too close to positive samples.
:param clip_length: float.
The clip length of the created negative clips.
:param ffmpeg_path: str.
The path of the ffmpeg. This is optional, which you could use when you have not added the
ffmpeg to the path environment variable.
:param skip_clip_length: float.
The skipped video length between two negative samples, this can be used to reduce the
number of the negative samples.
:return: pandas.DataFrame.
The data frame which contains start and end time of the negative clips.
"""
# get the length of the video
video_file_path = os.path.join(video_dir, video_file)
video_duration = get_video_length(video_file_path)
# get the actions intervals
if "temporal_coordinates" in video_info_df.columns:
temporal_interval_series = video_info_df.loc[
video_info_df["video_file"] == video_file, "temporal_coordinates"
]
temporal_interval_list = temporal_interval_series.apply(
lambda x: ast.literal_eval(x)
).tolist()
elif "temporal_segment_start" in video_info_df.columns:
video_start_list = video_info_df.loc[
video_info_df["video_file"] == video_file, "temporal_segment_start"
].to_list()
video_end_list = video_info_df.loc[
video_info_df["video_file"] == video_file, "temporal_segment_end"
].to_list()
temporal_interval_list = list(zip(video_start_list, video_end_list))
else:
raise Exception("There is no temporal information in the csv.")
if not all(
len(temporal_interval) % 2 == 0
for temporal_interval in temporal_interval_list
):
raise ValueError(
"There is at least one time interval "
"in {} having only one end point.".format(
str(temporal_interval_list)
)
)
temporal_interval_list = _merge_temporal_interval(temporal_interval_list)
negative_sample_interval_list = (
[0.0]
+ [t for interval in temporal_interval_list for t in interval]
+ [video_duration]
)
negative_sample_interval_list = [
[
negative_sample_interval_list[2 * i],
negative_sample_interval_list[2 * i + 1],
]
for i in range(len(negative_sample_interval_list) // 2)
]
clip_interval_list = _split_interval_list(
negative_sample_interval_list,
left_ignore_clip_length=ignore_clip_length,
right_ignore_clip_length=ignore_clip_length,
clip_length=clip_length,
skip_clip_length=skip_clip_length,
)
if not os.path.exists(negative_clip_dir):
os.makedirs(negative_clip_dir)
negative_clip_file_list = []
for i, clip_interval in enumerate(clip_interval_list):
start_time = clip_interval[0]
duration = clip_interval[1] - clip_interval[0]
negative_clip_file = "{}_{}.{}".format(video_file, i, clip_file_format)
negative_clip_file_list.append(negative_clip_file)
negative_clip_path = os.path.join(
negative_clip_dir, negative_clip_file
)
_extract_clip_ffmpeg(
start_time,
duration,
video_file_path,
negative_clip_path,
ffmpeg_path,
)
return pd.DataFrame(
{
"negative_clip_file_name": negative_clip_file_list,
"clip_duration": clip_interval_list,
"video_file": video_file,
}
)

Двоичные данные
contrib/action_recognition/video_annotation/media/fig1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 448 B

Двоичные данные
contrib/action_recognition/video_annotation/media/fig2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 423 B

Двоичные данные
contrib/action_recognition/video_annotation/media/fig3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 440 KiB

Двоичные данные
contrib/action_recognition/video_annotation/media/fig4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 565 B

Двоичные данные
contrib/action_recognition/video_annotation/media/fig5.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 567 B

Просмотреть файл

@ -0,0 +1,152 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import ast
import math
import os
import sys
sys.path.append("../")
import pandas as pd
import pytest
from video_annotation.video_annotation_utils import (
# Usually don't test private functions.
# But as clip-extraction results are tricky to test, we test some of the private functions here.
_merge_temporal_interval,
_split_interval_list,
create_clip_file_name,
extract_clip,
extract_negative_samples_per_file,
get_clip_action_label,
get_video_length,
)
VIDEO_DIR = os.path.join("tests", "data")
SAMPLE_VIDEO1_FILE = "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4"
SAMPLE_VIDEO1_PATH = os.path.join(VIDEO_DIR, SAMPLE_VIDEO1_FILE)
SAMPLE_ANNOTATION_FILE = "Unnamed-VIA Project19Sep2019_18h42m15s_export.csv"
SAMPLE_ANNOTATION_PATH = os.path.join(VIDEO_DIR, SAMPLE_ANNOTATION_FILE)
FRAME_PER_SECOND = 30
@pytest.fixture
def annotation_df():
video_info_df = pd.read_csv(SAMPLE_ANNOTATION_PATH, skiprows=1)
return video_info_df.loc[video_info_df["metadata"] != "{}"]
def test_create_clip_file_name(annotation_df):
row1 = annotation_df.iloc[0]
file1 = create_clip_file_name(row1, clip_file_format="mp4")
assert file1 == "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4_1_zCXg2CQ5.mp4"
file2 = create_clip_file_name(row1, clip_file_format="avi")
assert file2 == "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4_1_zCXg2CQ5.avi"
def test_get_clip_action_label(annotation_df):
row1 = annotation_df.iloc[0]
assert get_clip_action_label(row1) == "1.action_1"
def test_extract_clip(annotation_df, tmp_path):
row1 = annotation_df.iloc[0].copy()
row1["clip_action_label"] = get_clip_action_label(row1)
row1["clip_file_name"] = create_clip_file_name(
row1, clip_file_format="mp4"
)
extract_clip(
row=row1, video_dir=VIDEO_DIR, clip_dir=tmp_path, ffmpeg_path=None
)
output_clip_path = os.path.join(
tmp_path, row1["clip_action_label"], row1["clip_file_name"]
)
# Test if extracted positive clip length is the same as the annotated segment length
assert (
abs(
get_video_length(output_clip_path)
- (row1.temporal_segment_end - row1.temporal_segment_start)
)
<= 1 / FRAME_PER_SECOND
)
def test_extract_negative_samples_per_file(annotation_df, tmp_path):
"""TODO This function should test two things which are missing now:
1. assert if the extracted negative samples are not overlapped with any positive samples
2. assert the number of extracted negative samples are correct
"""
video_df = annotation_df.copy()
video_df["video_file"] = video_df.apply(
lambda x: ast.literal_eval(x.file_list)[0], axis=1
)
clip_length = 2
extract_negative_samples_per_file(
video_file=SAMPLE_VIDEO1_FILE,
video_dir=VIDEO_DIR,
video_info_df=video_df,
negative_clip_dir=tmp_path,
clip_file_format="mp4",
ignore_clip_length=0,
clip_length=clip_length,
ffmpeg_path=None,
skip_clip_length=0,
)
for i in range(4):
negative_clip_length = get_video_length(
os.path.join(tmp_path, "{}_{}.mp4".format(SAMPLE_VIDEO1_FILE, i))
)
assert abs(negative_clip_length - clip_length) <= 1 / FRAME_PER_SECOND
def test_get_video_length():
assert get_video_length(SAMPLE_VIDEO1_PATH) == 18.719
def test_merge_temporal_interval():
interval_list1 = [(1, 2.5), (1.5, 2), (0.5, 1.5)]
merged_interval_list1 = _merge_temporal_interval(interval_list1)
assert merged_interval_list1 == [[0.5, 2.5]]
interval_list2 = [(-1.1, 0), (0, 1.2), (4.5, 7), (6.8, 8.5)]
merged_interval_list2 = _merge_temporal_interval(interval_list2)
assert merged_interval_list2 == [[-1.1, 1.2], [4.5, 8.5]]
def _float_tuple_close(input_tuple1, input_tuple2):
return all(
math.isclose(input_tuple1[i], input_tuple2[i])
for i in range(len(input_tuple1))
)
def _float_tuple_list_close(input_tuple_list1, input_tuple_list2):
return all(
_float_tuple_close(input_tuple_list1[i], input_tuple_list2[i])
for i in range(len(input_tuple_list1))
)
def test_split_interval_list():
interval_list1 = [(0.5, 3.0), (5.0, 9.0)]
res1 = _split_interval_list(
interval_list1,
left_ignore_clip_length=0.3,
right_ignore_clip_length=0.5,
clip_length=0.6,
skip_clip_length=0.1,
)
assert _float_tuple_list_close(
res1,
[
(0.8, 1.4),
(1.5, 2.1),
(5.3, 5.9),
(6.0, 6.6),
(6.7, 7.3),
(7.4, 8.0),
],
)

Просмотреть файл

@ -0,0 +1,47 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""
ffmpeg video conversion from 'asf' to 'mp4' and 'avi' without video quality loss:
referenced stackoverflow answer:
https://stackoverflow.com/questions/15049829/remux-to-mkv-but-add-all-streams-using-ffmpeg/15052662#15052662
"""
import argparse
import os
sys.path.append("lib")
from video_annotation_utils import video_format_conversion
def main(video_dir, output_dir):
for output_format in ["mp4", "avi"]:
output_sub_dir = os.path.join(output_dir, output_format)
if not os.path.exists(output_sub_dir):
os.makedirs(output_sub_dir)
# get all the files in the directory
for video_file in os.listdir(video_dir):
if video_file[-3:] == "asf":
video_path = os.path.join(video_dir, video_file)
output_file_name = video_file[:-4] + ".{}".format(
output_format
)
output_path = os.path.join(output_sub_dir, output_file_name)
video_format_conversion(
video_path, output_path, h264_format=True
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input_dir", help="Input video dir")
parser.add_argument(
"-o",
"--output_dir",
help="Output dir where the converted videos will be stored",
default="./outputs",
)
args = parser.parse_args()
main(args.input_dir, args.output_dir)

62
docker/Dockerfile Normal file
Просмотреть файл

@ -0,0 +1,62 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
ARG ENV="cpu"
ARG HOME="/root"
FROM ubuntu:18.04 AS cpu
ARG HOME
ENV HOME="${HOME}"
WORKDIR ${HOME}
# Install base dependencies
RUN apt-get update && \
apt-get install -y curl git build-essential
# Install Anaconda
ARG ANACONDA="https://repo.continuum.io/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh"
RUN curl ${ANACONDA} -o anaconda.sh && \
/bin/bash anaconda.sh -b -p conda && \
rm anaconda.sh
ENV PATH="${HOME}/conda/envs/cv/bin:${PATH}"
# Clone Computer Vision repo
ARG BRANCH="master"
RUN git clone --depth 1 --single-branch -b ${BRANCH} https://github.com/microsoft/computervision
# Setup Jupyter notebook configuration
ENV NOTEBOOK_CONFIG="${HOME}/.jupyter/jupyter_notebook_config.py"
RUN mkdir ${HOME}/.jupyter && \
echo "c.NotebookApp.token = ''" >> ${NOTEBOOK_CONFIG} && \
echo "c.NotebookApp.ip = '0.0.0.0'" >> ${NOTEBOOK_CONFIG} && \
echo "c.NotebookApp.allow_root = True" >> ${NOTEBOOK_CONFIG} && \
echo "c.NotebookApp.open_browser = False" >> ${NOTEBOOK_CONFIG} && \
echo "c.MultiKernelManager.default_kernel_name = 'python3'" >> ${NOTEBOOK_CONFIG}
# GPU Stage
FROM nvidia/cuda:9.0-base AS gpu
ARG HOME
WORKDIR ${HOME}
COPY --from=cpu ${HOME} .
ENV PATH="${HOME}/conda/envs/cv/bin:${PATH}"
# Final Stage
FROM $ENV AS final
# Install Conda dependencies
RUN conda env create -f computervision/environment.yml && \
conda clean -fay && \
python -m ipykernel install --user --name 'cv' --display-name 'python3'
ARG HOME
WORKDIR ${HOME}/computervision
EXPOSE 8888
CMD ["jupyter", "notebook"]

60
docker/README.md Normal file
Просмотреть файл

@ -0,0 +1,60 @@
Docker Support
==============
The Dockerfile in this directory will build Docker images with all the dependencies and code needed to run example notebooks or unit tests included in this repository.
Multiple environments are supported by using [multistage builds](https://docs.docker.com/develop/develop-images/multistage-build/). In order to efficiently build the Docker images in this way, [Docker BuildKit](https://docs.docker.com/develop/develop-images/build_enhancements/) is necessary.
The following examples show how to build and run the Docker image for CPU and GPU environments. Note on some platforms, one needs to manually specify the environment variable for `DOCKER_BUILDKIT`to make sure the build runs well. For example, on a Windows machine, this can be done by the powershell command as below, before building the image
```
$env:DOCKER_BUILDKIT=1
```
Once the container is running you can access Jupyter notebooks at http://localhost:8888.
Building and Running with Docker
--------------------------------
<details>
<summary><strong><em>CPU environment</em></strong></summary>
```
DOCKER_BUILDKIT=1 docker build -t computervision:cpu --build-arg ENV="cpu" .
docker run -p 8888:8888 -d computervision:cpu
```
</details>
<details>
<summary><strong><em>GPU environment</em></strong></summary>
```
DOCKER_BUILDKIT=1 docker build -t computervision:gpu --build-arg ENV="gpu" .
docker run --runtime=nvidia -p 8888:8888 -d computervision:gpu
```
</details>
Build Arguments
---------------
There are several build arguments which can change how the image is built. Similar to the `ENV` build argument these are specified during the docker build command.
Build Arg|Description|
---------|-----------|
ENV|Environment to use, options: cpu, gpu|
BRANCH|Git branch of the repo to use (defaults to `master`)
ANACONDA|Anaconda installation script (defaults to miniconda3 4.6.14)|
Example using the staging branch:
```
DOCKER_BUILDKIT=1 docker build -t computervision:cpu --build-arg ENV="cpu" --build-arg BRANCH="staging" .
```
In order to see detailed progress with BuildKit you can provide a flag during the build command: ```--progress=plain```
Running tests with docker
-------------------------
```
docker run -it computervision:cpu pytests tests/unit
```

Просмотреть файл

@ -20,7 +20,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Open-source annotation tools for object detection and for image segmentation exist. When there is only one object per image, labeling is done using separate folders for each image class. However we have not found a good tool for image classification when it's possible to have multiple objects in a single image.\n",
"Open-source annotation tools for object detection and for image segmentation exist, however for image classification are less common. When there is only one object per image, labeling can be done by moving images manually into separate folders for each image class. This stategy however is manual, and does not work when it's possible to have multiple different objects in a single image. For such cases, either this notebook can be used, or e.g. this cloud-based [labeling tool](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-label-images).\n",
"\n",
"This notebook provides a simple UI to assist in labeling images. Each image can be annotated with one or more classes or be marked as \"Exclude\" to indicate that the image should not be used for model training or evaluation. "
]

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -5,6 +5,7 @@
This document tries to answer frequent questions related to object detection. For generic Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).
* General
* [Why Torchvision?](#why-torchvision)
* Data
* [How to annotate images?](#how-to-annotate-images)
@ -18,6 +19,10 @@ This document tries to answer frequent questions related to object detection. Fo
## General
### Why Torchvision?
Torchvision has a large active user-base and hence its object detection implementation is easy to use, well tested, and uses state-of-the-art technology which has proven itself in the community. For these reasons we decided to use Torchvision as our object detection library. For advanced users who want to experiment with the latest cutting-edge technology, we recommend to start with our Torchvision notebooks and then also to look into more researchy implementations such as the [mmdetection](https://github.com/open-mmlab/mmdetection) repository.
## Data
### How to annotate images?

Двоичные данные
scenarios/detection/media/figures.pptx

Двоичный файл не отображается.

Двоичные данные
scenarios/detection/media/hard_neg.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 191 KiB

Просмотреть файл

@ -21,3 +21,19 @@ jobs:
displayName: Add conda to PATH
- template: templates/unit-test-steps-not-linuxgpu.yml # Template reference
- script: |
call conda env remove -n cv -y
rmdir /s /q C:\Anaconda\envs\cv
workingDirectory: tests
displayName: 'Conda remove'
continueOnError: true
condition: succeededOrFailed()
timeoutInMinutes: 10
- script: |
del /q /S %LOCALAPPDATA%\Temp\*
for /d %%i in (%LOCALAPPDATA%\Temp\*) do @rmdir /s /q "%%i"
displayName: 'Remove Temp Files'
condition: succeededOrFailed()

Просмотреть файл

@ -16,4 +16,20 @@ jobs:
name: cvbpwinpool
steps:
- template: templates/unit-test-steps-not-linuxgpu.yml # Template reference
- template: templates/unit-test-steps-not-linuxgpu.yml # Template reference
- script: |
call conda env remove -n cv -y
rmdir /s /q C:\Anaconda\envs\cv
workingDirectory: tests
displayName: 'Conda remove'
continueOnError: true
condition: succeededOrFailed()
timeoutInMinutes: 10
- script: |
del /q /S %LOCALAPPDATA%\Temp\*
for /d %%i in (%LOCALAPPDATA%\Temp\*) do @rmdir /s /q "%%i"
displayName: 'Remove Temp Files'
condition: succeededOrFailed()

Просмотреть файл

@ -1,44 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
# Implementing the scheduler from the dashboard
# Uncomment in case it wants to be done from using the yml
# schedules:
# - cron: "56 22 * * *"
# displayName: Daily track of metrics
# branches:
# include:
# - master
# always: true
# no PR builds
pr: none
# no CI trigger
trigger: none
jobs:
- job: Repometrics
pool:
vmImage: 'ubuntu-16.04'
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.6'
architecture: 'x64'
- script: |
cp tools/repo_metrics/config_template.py tools/repo_metrics/config.py
sed -i 's#<GITHUB_TOKEN>#$(github_token)#' tools/repo_metrics/config.py
sed -i "s#<CONNECTION_STRING>#`echo '$(cosmosdb_connectionstring)' | sed 's@&@\\\\&@g'`#" tools/repo_metrics/config.py
displayName: Configure CosmosDB Connection
- script: |
python -m pip install 'python-dateutil>=2.8.0' 'pymongo>=3.8.0' 'gitpython>2.1.11' 'requests>=2.21.0'
python tools/repo_metrics/track_metrics.py --github_repo 'https://github.com/microsoft/ComputerVision' --save_to_database
displayName: Python script to record stats

Просмотреть файл

@ -139,6 +139,7 @@ def detection_notebooks():
"11": os.path.join(
folder_notebooks, "11_exploring_hyperparameters_on_azureml.ipynb"
),
"12": os.path.join(folder_notebooks, "12_hard_negative_sampling.ipynb"),
}
return paths

Просмотреть файл

@ -26,3 +26,20 @@ def test_01_notebook_run(detection_notebooks):
assert len(nb_output.scraps["training_losses"].data) == epochs
assert nb_output.scraps["training_losses"].data[-1] < 0.5
assert nb_output.scraps["training_average_precision"].data[-1] > 0.5
@pytest.mark.notebooks
@pytest.mark.linuxgpu
def test_12_notebook_run(detection_notebooks):
notebook_path = detection_notebooks["12"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
parameters=dict(PM_VERSION=pm.__version__, EPOCHS=3),
kernel_name=KERNEL_NAME,
)
nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
assert nb_output.scraps["valid_accs"].data[-1] > 0.5
assert len(nb_output.scraps["valid_accs"].data) == 1
assert len(nb_output.scraps["hard_im_scores"].data) == 10

Просмотреть файл

@ -22,7 +22,7 @@ def test_01_notebook_run(similarity_notebooks):
)
nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
assert nb_output.scraps["median_rank"].data <= 10
assert nb_output.scraps["median_rank"].data <= 15
@pytest.mark.notebooks

Просмотреть файл

@ -86,7 +86,7 @@ def test_detection_dataset_init_basic(tiny_od_data_path, od_data_path_labels):
""" Tests that initialization of the Detection Dataset works. """
data = DetectionDataset(tiny_od_data_path)
validate_detection_dataset(data, od_data_path_labels)
assert len(data.test_ds) == 20
assert len(data.test_ds) == 19
assert len(data.train_ds) == 20
@ -96,7 +96,7 @@ def test_detection_dataset_init_train_pct(
""" Tests that initialization with train_pct."""
data = DetectionDataset(tiny_od_data_path, train_pct=0.75)
validate_detection_dataset(data, od_data_path_labels)
assert len(data.test_ds) == 10
assert len(data.test_ds) == 9
assert len(data.train_ds) == 30
@ -105,6 +105,11 @@ def test_detection_dataset_show_ims(basic_detection_dataset):
basic_detection_dataset.show_ims()
def test_detection_dataset_show_im_transformations(basic_detection_dataset):
# simply test that this is error free for now
basic_detection_dataset.show_im_transformations()
def test_detection_dataset_init_anno_im_dirs(
func_tiny_od_data_path, od_data_path_labels
):

Просмотреть файл

@ -5,6 +5,9 @@ from torchvision.models.detection.faster_rcnn import FasterRCNN
from collections.abc import Iterable
import numpy as np
import pytest
import shutil
from pathlib import Path
from typing import Union
from utils_cv.detection.bbox import DetectionBbox
from utils_cv.detection.model import (
@ -128,3 +131,77 @@ def test_detection_dataset_predict_dl(
):
""" Simply test that `predict_dl` works. """
od_detection_learner.predict_dl(od_detection_dataset.test_dl)
def validate_saved_model(name: str, path: str) -> bool:
""" Tests that saved model is there """
assert (Path(path)).exists()
assert (Path(path) / name).exists()
assert (Path(path) / name / "meta.json").exists()
assert (Path(path) / name / "model.pt").exists()
@pytest.mark.gpu
def test_detection_save_model(od_detection_learner, tiny_od_data_path):
""" Test that save function works. """
# test without path (default to using data_path()/models)
model_name = "my_test_model"
od_detection_learner.save(model_name)
validate_saved_model(model_name, Path(tiny_od_data_path) / "models")
# test with path
od_detection_learner.save(
model_name, str(Path(tiny_od_data_path) / "layer")
)
validate_saved_model(model_name, str(Path(tiny_od_data_path) / "layer"))
# test with overwrite
with pytest.raises(Exception):
od_detection_learner.save(model_name, overwrite=False)
@pytest.mark.gpu
@pytest.fixture(scope="session")
def saved_model(od_detection_learner, tiny_od_data_path) -> Union[str, Path]:
""" A saved model so that loading functions can reuse. """
model_name = "test_fixture_model"
od_detection_learner.save(model_name)
assert (Path(tiny_od_data_path) / "models" / model_name).exists()
return model_name, Path(tiny_od_data_path) / "models"
@pytest.mark.gpu
def test_detection_load_model(
od_detection_learner, tiny_od_data_path, saved_model
):
""" Test that load function works. """
# test basic loading
name, path = saved_model
od_detection_learner.load(name=name)
od_detection_learner.load(name=name, path=path)
assert od_detection_learner.labels is not None
# do not specify name or path, it should quietly exit
with pytest.raises(SystemExit) as pytest_wrapped_e:
od_detection_learner.load()
assert pytest_wrapped_e.type == SystemExit
# test of no model files exists
shutil.rmtree(path / name)
shutil.rmtree(Path(tiny_od_data_path) / "models")
with pytest.raises(Exception):
od_detection_learner.load()
# test if only one model file exists in the `data_path`
od_detection_learner.save("test_fixture_model")
od_detection_learner.load()
@pytest.mark.gpu
def test_detection_init_from_saved_model(saved_model):
""" Test that we can create an detection learner from a saved model. """
name, path = saved_model
DetectionLearner.from_saved_model(name, path)

Просмотреть файл

@ -29,8 +29,8 @@ def test_00_notebook_run(detection_notebooks):
assert len(nb_output.scraps["detection_bounding_box"].data) > 0
@pytest.mark.notebooks
@pytest.mark.gpu
@pytest.mark.notebooks
def test_01_notebook_run(detection_notebooks, tiny_od_data_path):
notebook_path = detection_notebooks["01"]
pm.execute_notebook(
@ -48,3 +48,25 @@ def test_01_notebook_run(detection_notebooks, tiny_od_data_path):
nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
assert len(nb_output.scraps["training_losses"].data) > 0
assert len(nb_output.scraps["training_average_precision"].data) > 0
@pytest.mark.gpu
@pytest.mark.notebooks
def test_12_notebook_run(detection_notebooks, tiny_od_data_path):
notebook_path = detection_notebooks["12"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
parameters=dict(
PM_VERSION=pm.__version__,
DATA_PATH=tiny_od_data_path,
EPOCHS=1,
IM_SIZE=100,
),
kernel_name=KERNEL_NAME,
)
nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
assert len(nb_output.scraps["valid_accs"].data) == 1
assert len(nb_output.scraps["hard_im_scores"].data) == 10

Просмотреть файл

@ -2,6 +2,7 @@
# Licensed under the MIT License.
import os
import copy
import math
from pathlib import Path
from random import randrange
@ -9,6 +10,7 @@ from typing import List, Tuple, Union
import torch
from torch.utils.data import Dataset, Subset, DataLoader
from torchvision.transforms import ColorJitter
import xml.etree.ElementTree as ET
from PIL import Image
@ -19,6 +21,26 @@ from .references.transforms import RandomHorizontalFlip, Compose, ToTensor
from utils_cv.common.gpu import db_num_workers
class ColorJitterTransform(object):
""" Wrapper for torchvision's ColorJitter to make sure 'target
object is passed along """
def __init__(self, brightness, contrast, saturation, hue):
self.brightness = brightness
self.contrast = contrast
self.saturation = saturation
self.hue = hue
def __call__(self, im, target):
im = ColorJitter(
brightness=self.brightness,
contrast=self.contrast,
saturation=self.saturation,
hue=self.hue,
)(im)
return im, target
def get_transform(train: bool) -> List[object]:
""" Gets basic the transformations to apply to images.
@ -33,10 +55,22 @@ def get_transform(train: bool) -> List[object]:
A list of transforms to apply.
"""
transforms = []
# transformations to apply before image is turned into a tensor
if train:
transforms.append(
ColorJitterTransform(
brightness=0.2, contrast=0.2, saturation=0.4, hue=0.05
)
)
# transform im to tensor
transforms.append(ToTensor())
# transformations to apply after image is turned into a tensor
if train:
transforms.append(RandomHorizontalFlip(0.5))
# TODO we can add more 'default' transformations here
return Compose(transforms)
@ -59,7 +93,7 @@ def parse_pascal_voc_anno(
# get image path from annotation. Note that the path field might not be set.
anno_dir = os.path.dirname(anno_path)
if root.find("path"):
if root.find("path") is not None:
im_path = os.path.realpath(
os.path.join(anno_dir, root.find("path").text)
)
@ -73,10 +107,10 @@ def parse_pascal_voc_anno(
for obj in objs:
label = obj.find("name").text
bnd_box = obj.find("bndbox")
left = int(bnd_box.find('xmin').text)
top = int(bnd_box.find('ymin').text)
right = int(bnd_box.find('xmax').text)
bottom = int(bnd_box.find('ymax').text)
left = int(bnd_box.find("xmin").text)
top = int(bnd_box.find("ymin").text)
right = int(bnd_box.find("xmax").text)
bottom = int(bnd_box.find("ymax").text)
# Set mapping of label name to label index
if labels is None:
@ -99,7 +133,7 @@ def parse_pascal_voc_anno(
class DetectionDataset:
""" An object detection dataset.
The dunder methods __init__, __getitem__, and __len__ were inspired from code found here:
The implementation of the dunder methods __init__, __getitem__, and __len__ were inspired from code found here:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#writing-a-custom-dataset-for-pennfudan
"""
@ -107,10 +141,12 @@ class DetectionDataset:
self,
root: Union[str, Path],
batch_size: int = 2,
transforms: object = get_transform(train=True),
train_transforms: object = get_transform(train=True),
test_transforms: object = get_transform(train=False),
train_pct: float = 0.5,
anno_dir: str = "annotations",
im_dir: str = "images",
allow_negatives: bool = False,
):
""" initialize dataset
@ -123,19 +159,22 @@ class DetectionDataset:
root: the root path of the dataset containing the image and
annotation folders
batch_size: batch size for dataloaders
transforms: the transformations to apply
train_transforms: the transformations to apply to the train set
test_transforms: the transformations to apply to the test set
train_pct: the ratio of training to testing data
annotation_dir: the name of the annotation subfolder under the root directory
im_dir: the name of the image subfolder under the root directory. If set to 'None' then infers image location from annotation .xml files
allow_negatives: is false (default) then will throw an error if no anntation .xml file can be found for a given image. Otherwise use image as negative, ie assume that the image does not contain any of the objects of interest.
"""
self.root = Path(root)
# TODO think about how transforms are working...
self.transforms = transforms
self.train_transforms = train_transforms
self.test_transforms = test_transforms
self.im_dir = im_dir
self.anno_dir = anno_dir
self.batch_size = batch_size
self.train_pct = train_pct
self.allow_negatives = allow_negatives
# read annotations
self._read_annos()
@ -146,21 +185,7 @@ class DetectionDataset:
)
# create training and validation data loaders
self.train_dl = DataLoader(
self.train_ds,
batch_size=self.batch_size,
shuffle=True,
num_workers=db_num_workers(),
collate_fn=collate_fn,
)
self.test_dl = DataLoader(
self.test_ds,
batch_size=self.batch_size,
shuffle=False,
num_workers=db_num_workers(),
collate_fn=collate_fn,
)
self.init_data_loaders()
def _read_annos(self) -> List[str]:
""" Parses all Pascal VOC formatted annotation files to extract all
@ -182,20 +207,33 @@ class DetectionDataset:
os.path.splitext(s)[0] + ".xml" for s in im_filenames
]
# Parse all annotations
# Read all annotations
self.im_paths = []
self.anno_paths = []
self.anno_bboxes = []
for anno_idx, anno_filename in enumerate(anno_filenames):
anno_path = self.root / self.anno_dir / str(anno_filename)
assert os.path.exists(
anno_path
), f"Cannot find annotation file: {anno_path}"
anno_bboxes, im_path = parse_pascal_voc_anno(anno_path)
# TODO For now, ignore all images without a single bounding box in it, otherwise throws error during training.
# Parse annotation file if present
if os.path.exists(anno_path):
anno_bboxes, im_path = parse_pascal_voc_anno(anno_path)
else:
if not self.allow_negatives:
raise FileNotFoundError(anno_path)
anno_bboxes = []
im_path = im_paths[anno_idx]
# Torchvision needs at least one ground truth bounding box per image. Hence for images without a single
# annotated object, adding a tiny bounding box with "background" label 0.
if len(anno_bboxes) == 0:
continue
anno_bboxes = [
AnnotationBbox.from_array(
[1, 1, 5, 5],
label_name=None,
label_idx=0,
im_path=im_path,
)
]
if self.im_dir is None:
self.im_paths.append(im_path)
@ -209,15 +247,21 @@ class DetectionDataset:
labels = []
for anno_bboxes in self.anno_bboxes:
for anno_bbox in anno_bboxes:
labels.append(anno_bbox.label_name)
if anno_bbox.label_name is not None:
labels.append(anno_bbox.label_name)
self.labels = list(set(labels))
# Set for each bounding box label name also what its integer representation is
for anno_bboxes in self.anno_bboxes:
for anno_bbox in anno_bboxes:
anno_bbox.label_idx = (
self.labels.index(anno_bbox.label_name) + 1
)
if (
anno_bbox.label_name is None
): # background rectangle is assigned id 0 by design
anno_bbox.label_idx = 0
else:
anno_bbox.label_idx = (
self.labels.index(anno_bbox.label_name) + 1
)
def split_train_test(
self, train_pct: float = 0.8
@ -231,20 +275,69 @@ class DetectionDataset:
Return
A training and testing dataset in that order
"""
# TODO Is it possible to make these lines in split_train_test() a bit
# more intuitive?
test_num = math.floor(len(self) * (1 - train_pct))
indices = torch.randperm(len(self)).tolist()
self.transforms = get_transform(train=True)
train = Subset(self, indices[test_num:])
train = copy.deepcopy(Subset(self, indices[test_num:]))
train.dataset.transforms = self.train_transforms
self.transforms = get_transform(train=False)
test = Subset(self, indices[: test_num + 1])
test = copy.deepcopy(Subset(self, indices[:test_num]))
test.dataset.transforms = self.test_transforms
return train, test
def init_data_loaders(self):
""" Create training and validation data loaders """
self.train_dl = DataLoader(
self.train_ds,
batch_size=self.batch_size,
shuffle=True,
num_workers=db_num_workers(),
collate_fn=collate_fn,
)
self.test_dl = DataLoader(
self.test_ds,
batch_size=self.batch_size,
shuffle=False,
num_workers=db_num_workers(),
collate_fn=collate_fn,
)
def add_images(
self,
im_paths: List[str],
anno_bboxes: List[AnnotationBbox],
target: str = "train",
):
""" Add new images to either the training or test set.
Args:
im_paths: path to the images.
anno_bboxes: ground truth boxes for each image.
target: specify if images are to be added to the training or test set. Valid options: "train" or "test".
Raises:
Exception if `target` variable is neither 'train' nor 'test'
"""
assert len(im_paths) == len(anno_bboxes)
for im_path, anno_bbox in zip(im_paths, anno_bboxes):
self.im_paths.append(im_path)
self.anno_bboxes.append(anno_bbox)
if target.lower() == "train":
self.train_ds.dataset.im_paths.append(im_path)
self.train_ds.dataset.anno_bboxes.append(anno_bbox)
self.train_ds.indices.append(len(self.im_paths) - 1)
elif target.lower() == "test":
self.test_ds.dataset.im_paths.append(im_path)
self.test_ds.dataset.anno_bboxes.append(anno_bbox)
self.test_ds.indices.append(len(self.im_paths) - 1)
else:
raise Exception(f"Target {target} unknown.")
# Re-initialize the data loaders
self.init_data_loaders()
def show_ims(self, rows: int = 1, cols: int = 3) -> None:
""" Show a set of images.
@ -256,6 +349,43 @@ class DetectionDataset:
"""
plot_grid(display_bboxes, self._get_random_anno, rows=rows, cols=cols)
def show_im_transformations(
self, idx: int = None, rows: int = 1, cols: int = 3
) -> None:
""" Show a set of images after transfomrations have been applied.
Args:
idx: the index to of the image to show the transformations for.
rows: number of rows to display
cols: number of cols to dipslay, NOTE: use 3 for best looing grid
Returns None but displays a grid of randomly applied transformations.
"""
if not hasattr(self, "transforms"):
print(
(
"Transformations are not applied ot the base dataset object.\n"
"Call this function on either the train_ds or test_ds instead:\n\n"
" my_detection_data.train_ds.dataset.show_im_transformations()"
)
)
else:
if idx is None:
idx = randrange(len(self.anno_paths))
def plotter(im, ax):
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(im)
def im_gen() -> torch.Tensor:
return self[idx][0].permute(1, 2, 0)
plot_grid(plotter, im_gen, rows=rows, cols=cols)
print(f"Transformations applied on {self.im_paths[idx]}:")
[print(transform) for transform in self.transforms.transforms]
def _get_random_anno(
self
) -> Tuple[List[AnnotationBbox], Union[str, Path]]:

Просмотреть файл

@ -1,8 +1,11 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
from typing import List, Tuple, Union, Generator, Optional
from pathlib import Path
import json
import shutil
from PIL import Image
import numpy as np
@ -130,13 +133,16 @@ def get_pretrained_fasterrcnn(
return model
def _calculate_ap(e: CocoEvaluator) -> float:
def _calculate_ap(
e: CocoEvaluator, iou_threshold_idx: Union[int, slice] = slice(0, None)
) -> float:
""" Calculate the Average Precision (AP) by averaging all iou
thresholds across all labels.
coco_eval.eval['precision'] is a 5-dimensional array. Each dimension
represents the following:
1. [T] 10 evenly distributed thresholds for IoU, from 0.5 to 0.95.
1. [T] 10 evenly distributed thresholds for IoU, from 0.5 to 0.95. By
default, we use slice(0, None) which is the average from 0.5 to 0.95.
2. [R] 101 recall thresholds, from 0 to 101
3. [K] label, set to slice(0, None) to get precision over all the labels in
the dataset. Then take the mean over all labels.
@ -147,7 +153,13 @@ def _calculate_ap(e: CocoEvaluator) -> float:
Therefore, coco_eval.eval['precision'][0, :, 0, 0, 2] represents the value
of 101 precisions corresponding to 101 recalls from 0 to 100 when IoU=0.5.
"""
precision_settings = (slice(0, None), slice(0, None), slice(0, None), 0, 2)
precision_settings = (
iou_threshold_idx,
slice(0, None),
slice(0, None),
0,
2,
)
coco_eval = e.coco_eval["bbox"].eval["precision"]
return np.mean(np.mean(coco_eval[precision_settings]))
@ -155,15 +167,46 @@ def _calculate_ap(e: CocoEvaluator) -> float:
class DetectionLearner:
""" Detection Learner for Object Detection"""
def __init__(self, dataset: Dataset, model: nn.Module = None):
""" Initialize leaner object. """
def __init__(
self,
dataset: Dataset = None,
model: nn.Module = None,
im_size: int = None,
):
""" Initialize leaner object.
You can only specify an image size `im_size` if `model` is not given.
Args:
dataset: the dataset. This class will infer labels if dataset is present.
model: the nn.Module you wish to use
im_size: image size for your model
"""
# if model is None, dataset must not be
if not model:
assert dataset is not None
# not allowed to specify im size if you're providing a model
if model:
assert im_size is None
# if im_size is not specified, use 500
if im_size is None:
im_size = 500
self.device = torch_device()
self.model = model
self.dataset = dataset
self.im_size = im_size
# setup model, default to fasterrcnn
if self.model is None:
self.model = get_pretrained_fasterrcnn(len(dataset.labels) + 1)
self.model = get_pretrained_fasterrcnn(
len(self.dataset.labels) + 1,
min_size=self.im_size,
max_size=self.im_size,
)
self.model.to(self.device)
def __getattr__(self, attr):
@ -175,6 +218,12 @@ class DetectionLearner:
)
)
def add_labels(self, labels: List[str]):
""" Add labels to this detector. This class does not expect a label
'__background__' in first element of the label list. Make sure it is
omitted before adding it. """
self.labels = labels
def fit(
self,
epochs: int,
@ -205,6 +254,7 @@ class DetectionLearner:
# store data in these arrays to plot later
self.losses = []
self.ap = []
self.ap_iou_point_5 = []
# main training loop
self.epochs = epochs
@ -227,6 +277,7 @@ class DetectionLearner:
# evaluate
e = self.evaluate(dl=self.dataset.test_dl)
self.ap.append(_calculate_ap(e))
self.ap_iou_point_5.append(_calculate_ap(e, iou_threshold_idx=0))
def plot_precision_loss_curves(
self, figsize: Tuple[int, int] = (10, 5)
@ -285,7 +336,8 @@ class DetectionLearner:
model = self.model.eval() # eval mode
with torch.no_grad():
pred = model([im])
labels = self.dataset.labels
labels = self.dataset.labels if self.dataset else self.labels
det_bboxes = _get_det_bboxes(pred, labels=labels)
# limit to threshold if threshold is set
@ -345,3 +397,163 @@ class DetectionLearner:
{"idx": im_idx, "det_bboxes": det_bboxes}
)
yield det_bbox_batch
def save(
self, name: str, path: str = None, overwrite: bool = True
) -> None:
""" Saves the model
Save your model in the following format:
/data_path()
+-- <name>
| +-- meta.json
| +-- model.pt
The meta.json will contain information like the labels and the im_size
The model.pt will contain the weights of the model
Args:
name: the name you wish to save your model under
path: optional path to save your model to, will use `data_path`
otherwise
overwrite: overwite existing models
Raise:
Exception if model file already exists but overwrite is set to
false
Returns None
"""
if path is None:
path = Path(self.dataset.root) / "models"
# make dir if not exist
if not Path(path).exists():
os.mkdir(path)
# make dir to contain all model/meta files
model_path = Path(path) / name
if model_path.exists():
if overwrite:
shutil.rmtree(str(model_path))
else:
raise Exception(
f"Model of {name} already exists in {path}. Set `overwrite=True` or use another name"
)
os.mkdir(model_path)
# set names
pt_path = model_path / f"model.pt"
meta_path = model_path / f"meta.json"
# save pt
torch.save(self.model.state_dict(), pt_path)
# save meta file
meta_data = {"labels": self.dataset.labels, "im_size": self.im_size}
with open(meta_path, "w") as meta_file:
json.dump(meta_data, meta_file)
print(f"Model is saved to {model_path}")
def load(self, name: str = None, path: str = None) -> None:
""" Loads a model.
Loads a model that is saved in the format that is outputted in the
`save` function.
Args:
name: The name of the model you wish to load. If no name is
specified, the function will still look for a model under the path
specified by `data_path`. If multiple models are available in that
path, it will require you to pass in a name to specify which one to
use.
path: Pass in a path if the model is not located in the
`data_path`. Otherwise it will assume that it is.
Raise:
Exception if passed in name/path is invalid and doesn't exist
"""
# set path
if not path:
if self.dataset:
path = Path(self.dataset.root) / "models"
else:
raise Exception("Specify a `path` parameter")
# if name is given..
if name:
model_path = path / name
pt_path = model_path / "model.pt"
if not pt_path.exists():
raise Exception(
f"No model file named model.pt exists in {model_path}"
)
meta_path = model_path / "meta.json"
if not meta_path.exists():
raise Exception(
f"No model file named meta.txt exists in {model_path}"
)
# if no name is given, we assume there is only one model, otherwise we
# throw an error
else:
models = [f.path for f in os.scandir(path) if f.is_dir()]
if len(models) == 0:
raise Exception(f"No model found in {path}.")
elif len(models) > 1:
print(
f"Multiple models were found in {path}. Please specify which you wish to use in the `name` argument."
)
for model in models:
print(model)
exit()
else:
pt_path = Path(models[0]) / "model.pt"
meta_path = Path(models[0]) / "meta.json"
# load into model
self.model.load_state_dict(
torch.load(pt_path, map_location=torch_device())
)
# load meta info
with open(meta_path, "r") as meta_file:
meta_data = json.load(meta_file)
self.labels = meta_data["labels"]
@classmethod
def from_saved_model(cls, name: str, path: str) -> "DetectionLearner":
""" Create an instance of the DetectionLearner from a saved model.
This function expects the format that is outputted in the `save`
function.
Args:
name: the name of the model you wish to load
path: the path to get your model from
Returns:
A DetectionLearner object that can inference.
"""
path = Path(path)
meta_path = path / name / "meta.json"
assert meta_path.exists()
im_size, labels = None, None
with open(meta_path) as json_file:
meta_data = json.load(json_file)
im_size = meta_data["im_size"]
labels = meta_data["labels"]
model = get_pretrained_fasterrcnn(
len(labels) + 1, min_size=im_size, max_size=im_size
)
detection_learner = DetectionLearner(model=model)
detection_learner.load(name=name, path=path)
return detection_learner

Просмотреть файл

@ -56,7 +56,12 @@ def plot_boxes(
"""
if len(bboxes) > 0:
draw = ImageDraw.Draw(im)
font = get_font(size=plot_settings.text_size)
for bbox in bboxes:
# do not draw background bounding boxes
if bbox.label_idx == 0:
continue
box = [(bbox.left, bbox.top), (bbox.right, bbox.bottom)]
@ -67,9 +72,6 @@ def plot_boxes(
width=plot_settings.rect_th,
)
# gets font
font = get_font(size=plot_settings.text_size)
# write prediction class
draw.text(
(bbox.left, bbox.top),