Staging (#390)

* DOCKER: add Dockerfile * DOCKER: update dockerfile * DOCKER: update dockerfile * DOCKER: path * DOCKER: add cv docker file * remove the tracking pipeline yml file * README updates (#358) * Updating environment.yml file in master (#323) * readme updates * mv media to scenarios folder * fixes * Update README.md * simplification of language, removing redundancy * added target audience section * Update SETUP.md * Update README.md * Update environment.yml * Update SETUP.md * env-update (#359) * Hyperdrive notebook updates (#356) All tests are passing (except for unrelated AML deployment notebooks) * transforms fix (#360) * Updating environment.yml file in master (#323) * fix for dataset transformations * remove extra cython in conda * pr comments' * refactor to use transform in class param * remove todo * update to transformer * added functionality to show transformations and updated notebook * Update FAQ.md * Adding contrib placeholder (#370) * DOCKER: update readme * adding missing lib dir * add i3d * Adding hard negative sampling notebook (#367) * DOCKER: use create instead of update * Add example gif to action recognition readme (#374) * code clean up * add i3d * code clean up * add action_recognition README content * add instructions and headers * fix conflicts * DOCKER: remove base env bin path * save/load detection code for deployment (#380) * Updating environment.yml file in master (#323) * save/load * load/save * load/save * remove cython duplicate * remove comment * docstring * tests for loading/saving * label bug * Syntax issues on lines 07 & 115 (#378) * Updating environment.yml file in master (#323) * update maximum time * Restore example figures (#357) * Staging (#365) * README updates (#358) * Updating environment.yml file in master (#323) * readme updates * mv media to scenarios folder * fixes * Update README.md * simplification of language, removing redundancy * added target audience section * Update SETUP.md * Update README.md * Update environment.yml * Update SETUP.md * env-update (#359) * Hyperdrive notebook updates (#356) All tests are passing (except for unrelated AML deployment notebooks) * Syntax issues on lines 07 & 115 * Update README.md * Update environment.yml * detection deploy model.py update (#381) * Updating environment.yml file in master (#323) * save/load * load/save * load/save * remove cython duplicate * remove comment * docstring * tests for loading/saving * label bug * initial notebook * minor update to model.py * revert hns nb * rm nb * ap at iou 0.5 (#385) * Updating environment.yml file in master (#323) * added ap_iou_05 * remove cython * bug fix * windows testing fix and other testing bugs (#383) * Update azure-pipeline-windows-cpu.yml * Update azure-pipeline-windows-gpu.yml * Update test_integration_similarity_notebooks.py * Update test_detection_notebooks.py * 00 notebook (#386) * Updating environment.yml file in master (#323) * 00 notebook update * remove extra dependency * remove cython * remove typo * Update 10_image_annotation.ipynb Added link to Azure annotation tool
2019-11-06 10:39:48 -05:00 · 2019-11-06 10:39:48 -05:00 · 3e0631e0dc
--- a/README.md
+++ b/README.md
@ -1,4 +1,5 @@
 # Computer Vision
+
 In recent years, we've see an extra-ordinary growth in Computer Vision, with applications in face recognition, image understanding, search, drones, mapping, semi-autonomous and autonomous vehicles. A key part to many of these applications are visual recognition tasks such as image classification, object detection and image similarity. 

 This repository provides examples and best practice guidelines for building computer vision systems. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in Computer Vision algorithms, neural architectures, and operationalizing such systems. Rather than creating implementions from scratch, we draw from existing state-of-the-art libraries and build additional utility around loading image data, optimizing and evaluating models, and scaling up to the cloud. In addition, having worked in this space for many years, we aim to answer common questions, point out frequently observed pitfalls, and show how to use the cloud for training and deployment.
@ -20,13 +21,17 @@ notebooks in this repo. Once your environment is setup, navigate to the

 ## Scenarios

-The following is a summary of commonly used Computer Vision scenarios that are covered in this repository. For each of these scenarios, we give you the tools to effectively build your own model. This includes simple tasks such as fine-tuning your own model on your own data, to more complex tasks such as hard-negative mining and even model deployment. See all supported scenarios [here](scenarios).
+The following is a summary of commonly used Computer Vision scenarios that are covered in this repository. For each of the main scenarios ("base"), we provide the tools to effectively build your own model. This includes simple tasks such as fine-tuning your own model on your own data, to more complex tasks such as hard-negative mining and even model deployment.

-| Scenario | Description |
-| -------- | ----------- |
-| [Classification](scenarios/classification) | Image Classification is a supervised machine learning technique that allows you to learn and predict the category of a given image. |
-| [Similarity](scenarios/similarity)  | Image Similarity is a way to compute a similarity score given a pair of images. Given an image, it allows you to identify the most similar image in a given dataset.  |
-| [Detection](scenarios/detection) | Object Detection is a supervised machine learning technique that allows you to detect the bounding box of an object within an image. |
+| Scenario | Support     | Description |
+| -------- | ----------- | ----------- |
+| [Classification](scenarios/classification) | Base | Image Classification is a supervised machine learning technique that allows you to learn and predict the category of a given image. |
+| [Similarity](scenarios/similarity)  | Base | Image Similarity is a way to compute a similarity score given a pair of images. Given an image, it allows you to identify the most similar image in a given dataset.  |
+| [Detection](scenarios/detection) | Base | Object Detection is a supervised machine learning technique that allows you to detect the bounding box of an object within an image. |
+| [Action recognition](contrib/action_recognition) | Contrib | COMING SOON. Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
+| [Crowd counting](contrib/crowd_counting) | Contrib | COMING SOON. Counting the number of people in low-crowd-density (e.g. less than 10 people) and high-crowd-density (e.g. thousands of people) scenarios.|
+
+We separate the supported CV scenarios into two locations: (i) **base**: code and notebooks within the "utils_cs" and "scenarios" folders which follow strict coding guidelines, are well tested and maintained; (ii) **contrib**: code and other assets within the "contrib" folder, mainly covering less common CV scenarios using bleeding edge state-of-the-art approaches. Code in "contrib" is not regularly tested or maintained.

 ## Computer Vision on Azure

@ -45,7 +50,7 @@ If you need to train your own model, the following services and links provide ad
 - [Azure Machine Learning service (AzureML)](https://azure.microsoft.com/en-us/services/machine-learning-service/)
 is a service that helps users accelerate the training and deploying of machine learning models. While not specific for computer vision workloads, the AzureML Python SDK can be used for scalable and reliable training and deployment of machine learning solutions to the cloud. We leverage Azure Machine Learning in several of the notebooks within this repository (e.g. [deployment to Azure Kubernetes Service](classification/notebooks/22_deployment_on_azure_kubernetes_service.ipynb))

- [Azure AI Reference architectures](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-python-models) 
+- [Azure AI Reference architectures](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-python-models)
 provide a set of examples (backed by code) of how to build common AI-oriented workloads that leverage multiple cloud components. While not computer vision specific, these reference architectures cover several machine learning workloads such as model deployment or batch scoring.

 ## Build Status
@ -62,8 +67,8 @@ provide a set of examples (backed by code) of how to build common AI-oriented wo

 ### AzureML Testing

-| Build Type | Branch | Status |  | Branch | Status | 
-| --- | --- | --- | --- | --- | --- | 
+| Build Type | Branch | Status |  | Branch | Status |
+| --- | --- | --- | --- | --- | --- |
 | **Linxu GPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/bp-azureml-unit-test-linux-gpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=41&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/bp-azureml-unit-test-linux-gpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=41&branchName=staging)|
 | **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/aml-unit-test-linux-cpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=37&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/aml-unit-test-linux-cpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=37&branchName=staging)|
 | **Notebook unit GPU** | master | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/azureml-unit-test-linux-nb-gpu?branchName=master)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=42&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/computervision/_apis/build/status/azureml/azureml-unit-test-linux-nb-gpu?branchName=staging)](https://dev.azure.com/best-practices/computervision/_build/latest?definitionId=42&branchName=staging) |
@ -73,5 +78,3 @@ provide a set of examples (backed by code) of how to build common AI-oriented wo
 ## Contributing
 This project welcomes contributions and suggestions. Please see our [contribution guidelines](CONTRIBUTING.md).

-
-
--- a/SETUP.md
+++ b/SETUP.md
@ -3,7 +3,7 @@
 This document describes how to setup all the dependencies to run the notebooks
 in this repository.

-Many computer visions scenarios are extremely computationlly heavy. Training a
+Many computer visions scenarios are extremely computationally heavy. Training a
 model often requires a machine that has a GPU, and would otherwise be too slow.
 We recommend using the GPU-enabled [Azure Data Science Virtual Machine (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) since it comes prepared with a lot of the prerequisites needed to efficiently do computer vision.

@ -112,10 +112,11 @@ $ssh -L local_port:remote_address:remote_port  <username>@<server-ip>

 For example, if I want to run `jupyter notebook --port 8888` on my VM and I
 wish to run the Jupyter notebooks on my local broswer on `localhost:9999`, I
-would ssh into my VM using the following commend:
+would ssh into my VM using the following command:
+
 ```
 $ssh -L 9999:localhost:8888 <username>@<server-ip>
 ```

-This command will allow your local machine's port 9999 to access your remote
-machine's port 8888.
+This command will allow your local machine's port `9999` to access your remote
+machine's port `8888`.
--- a/contrib/README.md
+++ b/contrib/README.md
@ -7,4 +7,5 @@ Each project should live in its own subdirectory ```/contrib/<project>``` and co

 | Directory | Project description |
 |---|---|
-| vm_builder | This script helps users easily create an Ubuntu Data Science Virtual Machine with a GPU with the Computer Vision repo installed and ready to be used. If you find the script to be out-dated or not working, you can create the VM using the Azure portal or the Azure CLI tool with a few more steps. |
+| [Action recognition](action_recognition) | COMING SOON. Action recognition to identify in video/webcam footage what actions are performed (e.g. "running", "opening a bottle") and at what respective start/end times.|
+| [vm_builder](vm_builder) | This script helps users easily create an Ubuntu Data Science Virtual Machine with a GPU with the Computer Vision repo installed and ready to be used. If you find the script to be out-dated or not working, you can create the VM using the Azure portal or the Azure CLI tool with a few more steps. |
--- a/contrib/action_recognition/README.md
+++ b/contrib/action_recognition/README.md
@ -0,0 +1,66 @@
+# Action Recognition
+
+This is a place holder. Content will follow soon.
+
+![](./media/action_recognition.gif)
+
+*Example of action recognition*
+
+## Overview
+
+| Folders |  Description |
+| -------- |  ----------- |
+| [i3d](i3d)  | Scripts for fine-tuning a pre-trained Two-Stream Inflated 3D ConvNet (I3D) model on the HMDB-51 dataset
+| [video_annotation](video_annotation)  | Instructions and helper functions to annotate the start and end position of actions in video footage|
+
+## Functionality
+
+In [i3d](i3d) we show how to fine-tune a Two-Stream Inflated 3D ConvNet (I3D) model. This model was introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\] and achieved state-of-the-art in action classification on the HMDB-51 and UCF-101 datasets. The paper demonstrated the effectiveness of pre-training action recognition models on large datasets - in this case the Kinetics Human Action Video dataset consisting of 306k examples and 400 classes. We provide code for replicating the results of this paper on HMDB-51. We use models pre-trained on Kinetics from [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d). Evaluating the model on the test set of the HMDB-51 dataset (split 1) using [i3d/test.py](i3d/test.py) should yield the following results:
+
+| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
+| ------- | -------| ------- |
+| RGB | 74.8 | 73.7 |
+| Optical flow | 77.1 | 77.5 |
+| Two-Stream | 80.7 | 81.2 |
+
+In order to train an action recognition model for a specific task, annotated training data from the relevant domain is needed. In [video_annotation](video_annotation), we provide tips and examples for how to use a best-in-class video annotation tool ([VGG Image Annotator](http://www.robots.ox.ac.uk/~vgg/software/via/)) to label the start and end positions of actions in videos.
+
+## State-of-the-art
+
+In the tables below, we list datasets which are commonly used and also give an overview of the state-of-the-art. Note that the information below is reasonably exhaustive and should cover most major publications until 2018. Expect however some level of incompleteness and slight incorrectness (e.g. publication year being off by plus/minus 1 year due) since the tables below were mainly compiled to give a high-level picture of where the field is and how it evolved over the last years.
+
+Recommended reading:
+- As introduction to action recognition the blog [Deep Learning for Videos: A 2018 Guide to Action Recognition](http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review).
+- [ActionRecognition.net](http://actionrecognition.net/files/dset.php) for the latest state-of-the-art accuracies on popular research benchmark datasets.
+- All papers highlighted in yellow in the publications table below.
+
+Popular datasets:
+
+| Name  | Year  |  Number of classes |	#Clips |	Average length per video | Notes |
+| ----- | ----- | ----------------- | ------- | -------------------------  |  ----------- |
+| KTH   | 2004| 6| 600| |  | |
+|Weizmann|	2005|	9|	81|	|	 |
+|HMDB-51| 2011|	51|	6.8k| |	 |
+|UCF-101|	2012|	101|	13.3k|	7 sec (min: 1sec, max: 71sec)|	|
+|Sports-1M|	2014|	487|	1M| | |
+|THUMOS14|	2014|	101|	18k|	(total: 254h)|	Dataset for temporal action |
+|ActivityNet|	2015|	200|	28.1k|	? 1 min 40 sec|	|
+|Charades|	2016|	157|	66.5k from 9848 videos|	Each video (not action) is 30 seconds	| Daily tasks, classification and temporal localization challenges|
+|Youtube-8M|	2016|	4800 | |  | Not an action dataset, but rather a classification one (ie what objects occur in each video). Additional videos added in 2018.|
+|Kinetics-400|	2017|	400|	306k|	10 sec|	|
+|Kinetics-600|	2018|	600|	496k|   |
+|Something-Something|	2017|	174|	110k|	2-6 sec	| Low level actions, e.g. "pushing something left to right". Additional videos added in 2019.|
+|AVA|	2018|	80|	1.6M in 430 videos|	Each video is 15min long with 1 frame annotated per second with location of person, and for each  person one of 80 "atomic" actions. Combine people annotations into tracks.
+|Youtube-8M Segments|	2019|	1000|	237k|	5sec|	Used for localization Kaggle challenge. Think focuses on objects, not actions.|
+
+
+
+Popular publications, with recommended papers to read highlighted in yellow:
+<img align="center" src="./media/publications.png"/>  
+
+
+Most pulications focus on accuracy rather than on inferencing speed. The paper "Representation Flow for Action Recognition" is a noteworthy exception with this figure:
+<img align="center" src="./media/inference_speeds.png" width = "500"/>  
+
+\[1\] J. Carreira and A. Zisserman. Quo vadis, action recognition?
+a new model and the kinetics dataset. In CVPR, 2017.
--- a/contrib/action_recognition/i3d/.gitignore
+++ b/contrib/action_recognition/i3d/.gitignore
@ -0,0 +1,7 @@
+__pycache__/
+models/__pycache__/
+log/
+.vscode/
+checkpoints/
+pretrained_models/
+inference/.ipynb_checkpoints/
--- a/contrib/action_recognition/i3d/README.md
+++ b/contrib/action_recognition/i3d/README.md
@ -0,0 +1,61 @@
+## Fine-tuning I3D model on HMDB-51
+
+In this section we provide code for training a Two-Stream Inflated 3D ConvNet (I3D), introduced in \[[1](https://arxiv.org/pdf/1705.07750.pdf)\].  The code uses the Pytorch models provided in [https://github.com/piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d) - which have been pre-trained on the Kinetics Human Action Video dataset - and fine-tunes the models on the HMDB-51 action recognition dataset. The I3D model consists of two "streams" which are independently trained models. One stream takes the RGB image frames from videos as input and the other stream takes pre-computed optical flow as input. At test time, the outputs of each stream model are averaged to make the final prediction. The model results are as follows:
+
+| Model | Paper top 1 accuracy (average over 3 splits) | Our models top 1 accuracy (split 1 only) |
+| ------- | -------| ------- |
+| RGB | 74.8 | 73.7 |
+| Optical flow | 77.1 | 77.5 |
+| Two-Stream | 80.7 | 81.2 |
+
+## Download and pre-process HMDB-51 data
+
+Download the HMDB-51 video database from [here](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Extract the videos with
+```
+mkdir rars && mkdir videos
+unrar x hmdb51-org.rar rars/
+for a in $(ls rars); do unrar x "rars/${a}" videos/; done;
+```
+
+ Use code provided in [https://github.com/yjxiong/temporal-segment-networks](https://github.com/yjxiong/temporal-segment-networks) to preprocess the raw videos into split videos into RGB frames and compute optical flow frames:
+ ```
+ git clone https://github.com/yjxiong/temporal-segment-networks
+ cd temporal-segment-networks
+ bash scripts/extract_optical_flow.sh /path/to/hmdb51/videos /path/to/rawframes/output
+```
+Edit the _C.DATASET.DIR option in [default.py](default.py) to point towards the rawframes input data directory.
+
+## Setup environment
+Setup environment
+
+```
+conda env create -f environment.yaml
+conda activate i3d
+```
+
+## Download pretrained models
+Download pretrained models
+
+```
+bash download_models.sh
+```
+
+## Fine-tune pretrained models on HMDB-51
+
+Train RGB model
+```
+python train.py --cfg config/train_rgb.yaml
+```
+
+Train flow model
+```
+python train.py --cfg config/train_flow.yaml
+```
+
+Evaluate combined model
+```
+python test.py
+```
+
+\[1\] J. Carreira and A. Zisserman. Quo vadis, action recognition?
+a new model and the kinetics dataset. In CVPR, 2017.
--- a/contrib/action_recognition/i3d/config/train_flow.yaml
+++ b/contrib/action_recognition/i3d/config/train_flow.yaml
@ -0,0 +1,4 @@
+MODEL:
+  NAME: "i3d_flow"
+TRAIN:
+  MODALITY: "flow"
--- a/contrib/action_recognition/i3d/config/train_rgb.yaml
+++ b/contrib/action_recognition/i3d/config/train_rgb.yaml
@ -0,0 +1,4 @@
+MODEL:
+  NAME: "i3d_rgb"
+TRAIN:
+  MODALITY: "RGB"
--- a/contrib/action_recognition/i3d/dataset.py
+++ b/contrib/action_recognition/i3d/dataset.py
@ -0,0 +1,244 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# Adapted from https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/dataset.py
+
+import torch.utils.data as data
+import torch
+
+from PIL import Image
+import os
+import os.path
+import numpy as np
+from numpy.random import randint
+from pathlib import Path
+
+import torchvision
+from torchvision import datasets, transforms
+from videotransforms import (
+    GroupRandomCrop, GroupRandomHorizontalFlip,
+    GroupScale, GroupCenterCrop, GroupNormalize, Stack
+)
+
+from itertools import cycle
+
+
+class VideoRecord(object):
+    def __init__(self, row):
+        self._data = row
+
+    @property
+    def path(self):
+        return self._data[0]
+
+    @property
+    def num_frames(self):
+        return int(
+            len([x for x in Path(
+                self._data[0]).glob('img_*')])-1)
+
+    @property
+    def label(self):
+        return int(self._data[1])
+
+
+class I3DDataSet(data.Dataset):
+    def __init__(self, data_root, split=1, sample_frames=64, 
+            modality='RGB', transform=lambda x:x,
+            train_mode=True, sample_frames_at_test=False):
+
+        self.data_root = data_root
+        self.split = split
+        self.sample_frames = sample_frames
+        self.modality = modality
+        self.transform = transform
+        self.train_mode = train_mode
+        self.sample_frames_at_test = sample_frames_at_test
+
+        self._parse_split_files()
+
+
+    def _parse_split_files(self):
+            # class labels assigned by sorting the file names in /data/hmdb51_splits directory
+            file_list = sorted(Path('./data/hmdb51_splits').glob('*'+str(self.split)+'.txt'))
+            video_list = []
+            for class_idx, f in enumerate(file_list):
+                class_name = str(f).strip().split('/')[2][:-16]
+                for line in open(f):
+                    tokens = line.strip().split(' ')
+                    video_path = self.data_root+class_name+'/'+tokens[0][:-4]
+                    record = (video_path, class_idx)
+                    # 1 indicates video should be in training set
+                    if self.train_mode & (tokens[-1] == '1'):
+                        video_list.append(VideoRecord(record))
+                    # 2 indicates video should be in test set
+                    elif (self.train_mode == False) & (tokens[-1] == '2'):
+                        video_list.append(VideoRecord(record))
+                
+            self.video_list = video_list
+
+
+    def _load_image(self, directory, idx):
+        if self.modality == 'RGB':
+            img_path = os.path.join(directory, 'img_{:05}.jpg'.format(idx))
+            try:
+                img = Image.open(img_path).convert('RGB')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            return img
+        else:
+            try:
+                img_path = os.path.join(directory, 'flow_x_{:05}.jpg'.format(idx))
+                x_img = Image.open(img_path).convert('L')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            try:
+                img_path = os.path.join(directory, 'flow_y_{:05}.jpg'.format(idx))
+                y_img = Image.open(img_path).convert('L')
+            except:
+                print("Couldn't load image:{}".format(img_path))
+                return None
+            # Combine flow images into single PIL image
+            x_img = np.array(x_img, dtype=np.float32)
+            y_img = np.array(y_img, dtype=np.float32)
+            img = np.asarray([x_img, y_img]).transpose([1, 2, 0])
+            img = Image.fromarray(img.astype('uint8'))
+            return img
+
+
+    def _sample_indices(self, record):
+        if record.num_frames > self.sample_frames:
+            start_pos = randint(record.num_frames - self.sample_frames + 1)
+            indices = range(start_pos, start_pos + self.sample_frames, 1)
+        else:
+            indices = [x for x in range(record.num_frames)]
+        if len(indices) < self.sample_frames:
+            self._loop_indices(indices)
+        return indices
+
+
+    def _loop_indices(self, indices):
+        indices_cycle = cycle(indices)
+        while len(indices) < self.sample_frames:
+            indices.append(next(indices_cycle))
+
+
+    def __getitem__(self, index):
+        record = self.video_list[index]
+        # Sample frames from the the video for training, or if sampling
+        # turned on at test time
+        if self.train_mode or self.sample_frames_at_test:
+            segment_indices = self._sample_indices(record)
+        else:
+            segment_indices = [i for i in range(record.num_frames)]
+        # Image files are 1-indexed
+        segment_indices = [i+1 for i in segment_indices]
+        # Get video frame images
+        images = []
+        for i in segment_indices:
+            seg_img = self._load_image(record.path, i)
+            if seg_img is None:
+                raise ValueError("Couldn't load", record.path, i)
+            images.append(seg_img)
+        # Apply transformations
+        transformed_images = self.transform(images)
+
+        return transformed_images, record.label
+
+
+    def __len__(self):
+        return len(self.video_list)
+
+
+if __name__ == '__main__':
+
+    input_size = 224
+    resize_small_edge = 256
+
+    train_rgb = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='RGB',
+        train_mode=True,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupRandomCrop(input_size),
+            GroupRandomHorizontalFlip(),
+            GroupNormalize(modality="RGB"),
+            Stack(),
+        ])
+    )
+    item = train_rgb.__getitem__(10)
+    print("train_rgb:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    val_rgb = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='RGB',
+        train_mode=False,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupCenterCrop(input_size),
+            GroupNormalize(modality="RGB"),
+            Stack(),
+        ])
+    )
+    item = val_rgb.__getitem__(10)
+    print("val_rgb:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    train_flow = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='flow',
+        train_mode=True,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupRandomCrop(input_size),
+            GroupRandomHorizontalFlip(),
+            GroupNormalize(modality="flow"),
+            Stack(),
+        ])
+    )
+    item = train_flow.__getitem__(100)
+    print("train_flow:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
+
+    val_flow = I3DDataSet(
+        data_root='/datadir/rawframes/',
+        split=1,
+        sample_frames = 64,
+        modality='flow',
+        train_mode=False,
+        sample_frames_at_test=False,
+        transform=torchvision.transforms.Compose([
+            GroupScale(resize_small_edge),
+            GroupCenterCrop(input_size),
+            GroupNormalize(modality="flow"),
+            Stack(),
+        ])
+    )
+    item = val_flow.__getitem__(100)
+    print("val_flow:")
+    print(item[0].size())
+    print("max=", item[0].max())
+    print("min=", item[0].min())
+    print("label=",item[1])
--- a/contrib/action_recognition/i3d/default.py
+++ b/contrib/action_recognition/i3d/default.py
@ -0,0 +1,73 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from yacs.config import CfgNode as CN
+
+
+_C = CN()
+
+_C.LOG_DIR = "log"
+_C.WORKERS = 16
+_C.PIN_MEMORY = True
+_C.SEED = 42
+
+# Cudnn related params
+_C.CUDNN = CN()
+_C.CUDNN.BENCHMARK = True
+
+# Dataset
+_C.DATASET = CN()
+_C.DATASET.SPLIT = 1
+_C.DATASET.DIR = "/datadir/rawframes/"
+_C.DATASET.NUM_CLASSES = 51
+
+# NETWORK
+_C.MODEL = CN()
+_C.MODEL.NAME = "i3d_flow"
+_C.MODEL.PRETRAINED_RGB = "pretrained_models/rgb_imagenet_kinetics.pt"
+_C.MODEL.PRETRAINED_FLOW = "pretrained_models/flow_imagenet_kinetics.pt"
+_C.MODEL.CHECKPOINT_DIR = "checkpoints"
+
+# Train
+_C.TRAIN = CN()
+_C.TRAIN.PRINT_FREQ = 50
+_C.TRAIN.INPUT_SIZE = 224
+_C.TRAIN.RESIZE_MIN = 256
+_C.TRAIN.SAMPLE_FRAMES = 64
+_C.TRAIN.MODALITY = "flow"
+_C.TRAIN.BATCH_SIZE = 24
+_C.TRAIN.GRAD_ACCUM_STEPS = 4
+_C.TRAIN.MAX_EPOCHS = 50
+
+# Test
+_C.TEST = CN()
+_C.TEST.EVAL_FREQ = 5
+_C.TEST.PRINT_FREQ = 250
+_C.TEST.BATCH_SIZE = 1
+_C.TEST.MODALITY = "combined"
+_C.TEST.MODEL_RGB = "pretrained_models/rgb_hmdb_split1.pt"
+_C.TEST.MODEL_FLOW = "pretrained_models/flow_hmdb_split1.pt"
+
+def update_config(cfg, options=None, config_file=None):
+    cfg.defrost()
+
+    if config_file:
+        cfg.merge_from_file(config_file)
+
+    if options:
+        cfg.merge_from_list(options)
+
+    cfg.freeze()
+
+
+if __name__ == "__main__":
+    import sys
+
+    with open(sys.argv[1], "w") as f:
+        print(_C, file=f)
--- a/contrib/action_recognition/i3d/download_models.sh
+++ b/contrib/action_recognition/i3d/download_models.sh
@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+wget https://har.blob.core.windows.net/i3dmodels/flow_hmdb_split1.pt
+wget https://har.blob.core.windows.net/i3dmodels/rgb_hmdb_split1.pt
+wget https://har.blob.core.windows.net/i3dmodels/flow_imagenet_kinetics.pt
+wget https://har.blob.core.windows.net/i3dmodels/rgb_imagenet_kinetics.pt
+
+mv flow_hmdb_split1.pt pretrained_models/flow_hmdb_split1.pt
+mv rgb_hmdb_split1.pt pretrained_models/rgb_hmdb_split1.pt
+mv flow_imagenet_kinetics.pt pretrained_models/flow_imagenet_kinetics.pt
+mv rgb_imagenet_kinetics.pt pretrained_models/rgb_imagenet_kinetics.pt
--- a/contrib/action_recognition/i3d/environment.yml
+++ b/contrib/action_recognition/i3d/environment.yml
@ -0,0 +1,20 @@
+name: i3d
+dependencies:
+- python=3.6.2
+- pandas
+- numpy
+- ipykernel
+- matplotlib
+- pip:
+  - torch==1.2.0
+  - torchvision
+  - pillow
+  - fire
+  - tensorboardX
+  - tensorboard
+  - yacs
+  - opencv-contrib-python-headless
+  
+channels:
+- conda-forge
+- anaconda
--- a/contrib/action_recognition/i3d/inference.py
+++ b/contrib/action_recognition/i3d/inference.py
@ -0,0 +1,97 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+from pathlib import Path
+from PIL import Image
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchvision
+from torchvision import datasets, transforms
+
+from videotransforms import (
+    GroupScale, GroupCenterCrop, GroupNormalize, Stack
+)
+from models.pytorch_i3d import InceptionI3d
+from dataset import I3DDataSet
+from test import load_model
+
+
+def load_image(frame_file):
+    try:
+        img = Image.open(frame_file).convert('RGB')
+        return img
+    except:
+        print("Couldn't load image:{}".format(frame_file))
+        return None
+
+
+def load_frames(frame_paths):
+    frame_list = []
+    for frame in frame_paths:
+        frame_list.append(load_image(frame))
+    return frame_list
+
+
+def construct_input(frame_list):
+
+    transform = torchvision.transforms.Compose([
+                    GroupScale(config.TRAIN.RESIZE_MIN),
+                    GroupCenterCrop(config.TRAIN.INPUT_SIZE),
+                    GroupNormalize(modality="RGB"),
+                    Stack(),
+                ])
+
+    process_data = transform(frame_list)
+    return process_data.unsqueeze(0)
+
+
+def predict_input(model, input):
+    input = input.cuda(non_blocking=True)
+    output = model(input)
+    output = torch.mean(output, dim=2)
+    return output
+
+
+def predict_over_video(video_frame_list, window_width=9, stride=1):
+
+    if window_width < 9:
+        raise ValueError("window_width must be 9 or greater")
+
+    print("Loading model...")
+
+    model = load_model(
+        modality="RGB",
+        state_dict_file="pretrained_chkpt/rgb_hmdb_split1.pt"
+    )
+
+    model.eval()
+
+    print("Predicting actions over {0} frames".format(len(video_frame_list)))
+
+    with torch.no_grad():
+
+        window_count = 0
+
+        for i in range(stride+window_width-1, len(video_frame_list), stride):
+            window_frame_list = [video_frame_list[j] for j in range(i-window_width, i)]
+            frames = load_frames(window_frame_list)
+            batch = construct_input(frames)
+            window_predictions = predict_input(model, batch)
+            window_proba = F.softmax(window_predictions, dim=1)
+            window_top_pred = window_proba.max(1)
+            print(("Window:{0} Class pred:{1} Class proba:{2}".format(
+                window_count,
+                window_top_pred.indices.cpu().numpy()[0],
+                window_top_pred.values.cpu().numpy()[0])
+            ))
+            window_count += 1
+
+
+
+if __name__ == "__main__":
+
+    # Provide list of filepaths to video frames
+    frame_paths = []
+
+    predict_over_video(frame_list, window_width=64, stride=32)
--- a/contrib/action_recognition/i3d/metrics.py
+++ b/contrib/action_recognition/i3d/metrics.py
@ -0,0 +1,40 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# From https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/main.py
+
+import torch
+
+class AverageMeter(object):
+    """Computes and stores the average and current value"""
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count
+
+
+def accuracy(output, target, topk=(1,)):
+    """Computes the accuracy over the k top predictions for the specified values of k"""
+    with torch.no_grad():
+        maxk = max(topk)
+        batch_size = target.size(0)
+
+        _, pred = output.topk(maxk, 1, True, True)
+        pred = pred.t()
+        correct = pred.eq(target.view(1, -1).expand_as(pred))
+
+        res = []
+        for k in topk:
+            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
+            res.append(correct_k.mul_(100.0 / batch_size))
+        return res
--- a/contrib/action_recognition/i3d/models/pytorch_i3d.py
+++ b/contrib/action_recognition/i3d/models/pytorch_i3d.py
@ -0,0 +1,338 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+
+import numpy as np
+
+import os
+import sys
+from collections import OrderedDict
+
+
+class MaxPool3dSamePadding(nn.MaxPool3d):
+    
+    def compute_pad(self, dim, s):
+        if s % self.stride[dim] == 0:
+            return max(self.kernel_size[dim] - self.stride[dim], 0)
+        else:
+            return max(self.kernel_size[dim] - (s % self.stride[dim]), 0)
+
+    def forward(self, x):
+        # compute 'same' padding
+        (batch, channel, t, h, w) = x.size()
+        #print t,h,w
+        out_t = np.ceil(float(t) / float(self.stride[0]))
+        out_h = np.ceil(float(h) / float(self.stride[1]))
+        out_w = np.ceil(float(w) / float(self.stride[2]))
+        #print out_t, out_h, out_w
+        pad_t = self.compute_pad(0, t)
+        pad_h = self.compute_pad(1, h)
+        pad_w = self.compute_pad(2, w)
+        #print pad_t, pad_h, pad_w
+
+        pad_t_f = pad_t // 2
+        pad_t_b = pad_t - pad_t_f
+        pad_h_f = pad_h // 2
+        pad_h_b = pad_h - pad_h_f
+        pad_w_f = pad_w // 2
+        pad_w_b = pad_w - pad_w_f
+
+        pad = (pad_w_f, pad_w_b, pad_h_f, pad_h_b, pad_t_f, pad_t_b)
+        #print x.size()
+        #print pad
+        x = F.pad(x, pad)
+        return super(MaxPool3dSamePadding, self).forward(x)
+    
+
+class Unit3D(nn.Module):
+
+    def __init__(self, in_channels,
+                 output_channels,
+                 kernel_shape=(1, 1, 1),
+                 stride=(1, 1, 1),
+                 padding=0,
+                 activation_fn=F.relu,
+                 use_batch_norm=True,
+                 use_bias=False,
+                 name='unit_3d'):
+        
+        """Initializes Unit3D module."""
+        super(Unit3D, self).__init__()
+        
+        self._output_channels = output_channels
+        self._kernel_shape = kernel_shape
+        self._stride = stride
+        self._use_batch_norm = use_batch_norm
+        self._activation_fn = activation_fn
+        self._use_bias = use_bias
+        self.name = name
+        self.padding = padding
+        
+        self.conv3d = nn.Conv3d(in_channels=in_channels,
+                                out_channels=self._output_channels,
+                                kernel_size=self._kernel_shape,
+                                stride=self._stride,
+                                padding=0, # we always want padding to be 0 here. We will dynamically pad based on input size in forward function
+                                bias=self._use_bias)
+        
+        if self._use_batch_norm:
+            self.bn = nn.BatchNorm3d(self._output_channels, eps=0.001, momentum=0.01)
+
+    def compute_pad(self, dim, s):
+        if s % self._stride[dim] == 0:
+            return max(self._kernel_shape[dim] - self._stride[dim], 0)
+        else:
+            return max(self._kernel_shape[dim] - (s % self._stride[dim]), 0)
+
+            
+    def forward(self, x):
+        # compute 'same' padding
+        (batch, channel, t, h, w) = x.size()
+        #print t,h,w
+        out_t = np.ceil(float(t) / float(self._stride[0]))
+        out_h = np.ceil(float(h) / float(self._stride[1]))
+        out_w = np.ceil(float(w) / float(self._stride[2]))
+        #print out_t, out_h, out_w
+        pad_t = self.compute_pad(0, t)
+        pad_h = self.compute_pad(1, h)
+        pad_w = self.compute_pad(2, w)
+        #print pad_t, pad_h, pad_w
+
+        pad_t_f = pad_t // 2
+        pad_t_b = pad_t - pad_t_f
+        pad_h_f = pad_h // 2
+        pad_h_b = pad_h - pad_h_f
+        pad_w_f = pad_w // 2
+        pad_w_b = pad_w - pad_w_f
+
+        pad = (pad_w_f, pad_w_b, pad_h_f, pad_h_b, pad_t_f, pad_t_b)
+        #print x.size()
+        #print pad
+        x = F.pad(x, pad)
+        #print x.size()        
+
+        x = self.conv3d(x)
+        if self._use_batch_norm:
+            x = self.bn(x)
+        if self._activation_fn is not None:
+            x = self._activation_fn(x)
+        return x
+
+
+
+class InceptionModule(nn.Module):
+    def __init__(self, in_channels, out_channels, name):
+        super(InceptionModule, self).__init__()
+
+        self.b0 = Unit3D(in_channels=in_channels, output_channels=out_channels[0], kernel_shape=[1, 1, 1], padding=0,
+                         name=name+'/Branch_0/Conv3d_0a_1x1')
+        self.b1a = Unit3D(in_channels=in_channels, output_channels=out_channels[1], kernel_shape=[1, 1, 1], padding=0,
+                          name=name+'/Branch_1/Conv3d_0a_1x1')
+        self.b1b = Unit3D(in_channels=out_channels[1], output_channels=out_channels[2], kernel_shape=[3, 3, 3],
+                          name=name+'/Branch_1/Conv3d_0b_3x3')
+        self.b2a = Unit3D(in_channels=in_channels, output_channels=out_channels[3], kernel_shape=[1, 1, 1], padding=0,
+                          name=name+'/Branch_2/Conv3d_0a_1x1')
+        self.b2b = Unit3D(in_channels=out_channels[3], output_channels=out_channels[4], kernel_shape=[3, 3, 3],
+                          name=name+'/Branch_2/Conv3d_0b_3x3')
+        self.b3a = MaxPool3dSamePadding(kernel_size=[3, 3, 3],
+                                stride=(1, 1, 1), padding=0)
+        self.b3b = Unit3D(in_channels=in_channels, output_channels=out_channels[5], kernel_shape=[1, 1, 1], padding=0,
+                          name=name+'/Branch_3/Conv3d_0b_1x1')
+        self.name = name
+
+    def forward(self, x):    
+        b0 = self.b0(x)
+        b1 = self.b1b(self.b1a(x))
+        b2 = self.b2b(self.b2a(x))
+        b3 = self.b3b(self.b3a(x))
+        return torch.cat([b0,b1,b2,b3], dim=1)
+
+
+class InceptionI3d(nn.Module):
+    """Inception-v1 I3D architecture.
+    The model is introduced in:
+        Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
+        Joao Carreira, Andrew Zisserman
+        https://arxiv.org/pdf/1705.07750v1.pdf.
+    See also the Inception architecture, introduced in:
+        Going deeper with convolutions
+        Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
+        Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich.
+        http://arxiv.org/pdf/1409.4842v1.pdf.
+    """
+
+    # Endpoints of the model in order. During construction, all the endpoints up
+    # to a designated `final_endpoint` are returned in a dictionary as the
+    # second return value.
+    VALID_ENDPOINTS = (
+        'Conv3d_1a_7x7',
+        'MaxPool3d_2a_3x3',
+        'Conv3d_2b_1x1',
+        'Conv3d_2c_3x3',
+        'MaxPool3d_3a_3x3',
+        'Mixed_3b',
+        'Mixed_3c',
+        'MaxPool3d_4a_3x3',
+        'Mixed_4b',
+        'Mixed_4c',
+        'Mixed_4d',
+        'Mixed_4e',
+        'Mixed_4f',
+        'MaxPool3d_5a_2x2',
+        'Mixed_5b',
+        'Mixed_5c',
+        'Logits',
+        'Predictions',
+    )
+
+    def __init__(self, num_classes=400, spatial_squeeze=True,
+                 final_endpoint='Logits', name='inception_i3d', in_channels=3, dropout_keep_prob=0.5):
+        """Initializes I3D model instance.
+        Args:
+          num_classes: The number of outputs in the logit layer (default 400, which
+              matches the Kinetics dataset).
+          spatial_squeeze: Whether to squeeze the spatial dimensions for the logits
+              before returning (default True).
+          final_endpoint: The model contains many possible endpoints.
+              `final_endpoint` specifies the last endpoint for the model to be built
+              up to. In addition to the output at `final_endpoint`, all the outputs
+              at endpoints up to `final_endpoint` will also be returned, in a
+              dictionary. `final_endpoint` must be one of
+              InceptionI3d.VALID_ENDPOINTS (default 'Logits').
+          name: A string (optional). The name of this module.
+        Raises:
+          ValueError: if `final_endpoint` is not recognized.
+        """
+
+        if final_endpoint not in self.VALID_ENDPOINTS:
+            raise ValueError('Unknown final endpoint %s' % final_endpoint)
+
+        super(InceptionI3d, self).__init__()
+        self._num_classes = num_classes
+        self._spatial_squeeze = spatial_squeeze
+        self._final_endpoint = final_endpoint
+        self.logits = None
+
+        if self._final_endpoint not in self.VALID_ENDPOINTS:
+            raise ValueError('Unknown final endpoint %s' % self._final_endpoint)
+
+        self.end_points = {}
+        end_point = 'Conv3d_1a_7x7'
+        self.end_points[end_point] = Unit3D(in_channels=in_channels, output_channels=64, kernel_shape=[7, 7, 7],
+                                            stride=(2, 2, 2), padding=(3,3,3),  name=name+end_point)
+        if self._final_endpoint == end_point: return
+        
+        end_point = 'MaxPool3d_2a_3x3'
+        self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[1, 3, 3], stride=(1, 2, 2),
+                                                             padding=0)
+        if self._final_endpoint == end_point: return
+        
+        end_point = 'Conv3d_2b_1x1'
+        self.end_points[end_point] = Unit3D(in_channels=64, output_channels=64, kernel_shape=[1, 1, 1], padding=0,
+                                       name=name+end_point)
+        if self._final_endpoint == end_point: return
+        
+        end_point = 'Conv3d_2c_3x3'
+        self.end_points[end_point] = Unit3D(in_channels=64, output_channels=192, kernel_shape=[3, 3, 3], padding=1,
+                                       name=name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'MaxPool3d_3a_3x3'
+        self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[1, 3, 3], stride=(1, 2, 2),
+                                                             padding=0)
+        if self._final_endpoint == end_point: return
+        
+        end_point = 'Mixed_3b'
+        self.end_points[end_point] = InceptionModule(192, [64,96,128,16,32,32], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_3c'
+        self.end_points[end_point] = InceptionModule(256, [128,128,192,32,96,64], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'MaxPool3d_4a_3x3'
+        self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[3, 3, 3], stride=(2, 2, 2),
+                                                             padding=0)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_4b'
+        self.end_points[end_point] = InceptionModule(128+192+96+64, [192,96,208,16,48,64], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_4c'
+        self.end_points[end_point] = InceptionModule(192+208+48+64, [160,112,224,24,64,64], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_4d'
+        self.end_points[end_point] = InceptionModule(160+224+64+64, [128,128,256,24,64,64], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_4e'
+        self.end_points[end_point] = InceptionModule(128+256+64+64, [112,144,288,32,64,64], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_4f'
+        self.end_points[end_point] = InceptionModule(112+288+64+64, [256,160,320,32,128,128], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'MaxPool3d_5a_2x2'
+        self.end_points[end_point] = MaxPool3dSamePadding(kernel_size=[2, 2, 2], stride=(2, 2, 2),
+                                                             padding=0)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_5b'
+        self.end_points[end_point] = InceptionModule(256+320+128+128, [256,160,320,32,128,128], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Mixed_5c'
+        self.end_points[end_point] = InceptionModule(256+320+128+128, [384,192,384,48,128,128], name+end_point)
+        if self._final_endpoint == end_point: return
+
+        end_point = 'Logits'
+        self.avg_pool = nn.AvgPool3d(kernel_size=[2, 7, 7],
+                                     stride=(1, 1, 1))
+        self.dropout = nn.Dropout(dropout_keep_prob)
+        self.logits = Unit3D(in_channels=384+384+128+128, output_channels=self._num_classes,
+                             kernel_shape=[1, 1, 1],
+                             padding=0,
+                             activation_fn=None,
+                             use_batch_norm=False,
+                             use_bias=True,
+                             name='logits')
+
+        self.build()
+
+
+    def replace_logits(self, num_classes):
+        self._num_classes = num_classes
+        self.logits = Unit3D(in_channels=384+384+128+128, output_channels=self._num_classes,
+                             kernel_shape=[1, 1, 1],
+                             padding=0,
+                             activation_fn=None,
+                             use_batch_norm=False,
+                             use_bias=True,
+                             name='logits')
+        
+    
+    def build(self):
+        for k in self.end_points.keys():
+            self.add_module(k, self.end_points[k])
+        
+    def forward(self, x):
+        for end_point in self.VALID_ENDPOINTS:
+            if end_point in self.end_points:
+                x = self._modules[end_point](x) # use _modules to work with dataparallel
+
+        x = self.logits(self.dropout(self.avg_pool(x)))
+        if self._spatial_squeeze:
+            logits = x.squeeze(3).squeeze(3)
+        # logits is batch X time X classes, which is what we want to work with
+        return logits
+        
+
+    def extract_features(self, x):
+        for end_point in self.VALID_ENDPOINTS:
+            if end_point in self.end_points:
+                x = self._modules[end_point](x)
+        return self.avg_pool(x)
--- a/contrib/action_recognition/i3d/test.py
+++ b/contrib/action_recognition/i3d/test.py
@ -0,0 +1,170 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import os
+import time
+import sys
+import numpy as np
+
+import fire
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchvision
+from torchvision import datasets, transforms
+
+from videotransforms import (
+    GroupScale, GroupCenterCrop, GroupNormalize, Stack
+)
+
+from models.pytorch_i3d import InceptionI3d
+
+from metrics import accuracy, AverageMeter
+
+from dataset import I3DDataSet
+from default import _C as config
+from default import update_config
+
+# to work with vscode debugger https://github.com/joblib/joblib/issues/864
+import multiprocessing
+multiprocessing.set_start_method('spawn', True)
+
+
+def load_model(modality, state_dict_file):
+
+    channels = 3 if modality == "RGB" else 2
+    model = InceptionI3d(config.DATASET.NUM_CLASSES, in_channels=channels)
+    state_dict = torch.load(state_dict_file)
+    model.load_state_dict(state_dict)
+    model = model.cuda()
+
+    return model
+
+
+def test(model, test_loader, modality):
+
+    model.eval()
+
+    target_list = []
+    predictions_list = []
+    with torch.no_grad():
+        end = time.time()
+        for step, (input, target) in enumerate(test_loader):
+            target_list.append(target)
+            input = input.cuda(non_blocking=True)
+
+            # compute output
+            output = model(input)
+            output = torch.mean(output, dim=2)
+            predictions_list.append(output)
+
+            if step % config.TEST.PRINT_FREQ == 0:
+                print(('Step: [{0}/{1}]'.format(step, len(test_loader))))
+    
+    targets = torch.cat(target_list)
+    predictions = torch.cat(predictions_list)
+    return targets, predictions
+
+
+def run(*options, cfg=None):
+
+    update_config(config, options=options, config_file=cfg)
+
+    torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
+
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(config.SEED)
+    np.random.seed(seed=config.SEED)
+
+<<<<<<< HEAD
+=======
+    # Setup Augmentation/Transformation pipeline
+>>>>>>> f3bf37106de4839ab03af79636bdf83b4d7dfa52
+    input_size = config.TRAIN.INPUT_SIZE
+    resize_range_min = config.TRAIN.RESIZE_MIN
+
+    # Data-parallel
+    devices_lst = list(range(torch.cuda.device_count()))
+    print("Devices {}".format(devices_lst))
+
+    if (config.TEST.MODALITY == "RGB") or (config.TEST.MODALITY == "combined"):
+
+        rgb_loader = torch.utils.data.DataLoader(
+            I3DDataSet(
+                data_root=config.DATASET.DIR,
+                split=config.DATASET.SPLIT,
+                modality="RGB",
+                train_mode=False,
+                sample_frames_at_test=False,
+                transform=torchvision.transforms.Compose([
+                    GroupScale(config.TRAIN.RESIZE_MIN),
+                    GroupCenterCrop(config.TRAIN.INPUT_SIZE),
+                    GroupNormalize(modality="RGB"),
+                    Stack(),
+                ])
+            ),
+            batch_size=config.TEST.BATCH_SIZE,
+            shuffle=False,
+            num_workers=config.WORKERS,
+            pin_memory=config.PIN_MEMORY
+        )
+
+        rgb_model_file = config.TEST.MODEL_RGB
+        if not os.path.exists(rgb_model_file):
+            raise FileNotFoundError(rgb_model_file, " does not exist")
+        rgb_model = load_model(modality="RGB", state_dict_file=rgb_model_file)
+
+        print("scoring with rgb model")
+        targets, rgb_predictions = test(rgb_model, rgb_loader, "RGB")
+
+        del rgb_model
+        
+        targets = targets.cuda(non_blocking=True)
+        rgb_top1_accuracy = accuracy(rgb_predictions, targets, topk=(1, ))
+        print("rgb top1 accuracy: ", rgb_top1_accuracy[0].cpu().numpy().tolist())
+    
+    if (config.TEST.MODALITY == "flow") or (config.TEST.MODALITY == "combined"):
+
+        flow_loader = torch.utils.data.DataLoader(
+            I3DDataSet(
+                data_root=config.DATASET.DIR,
+                split=config.DATASET.SPLIT,
+                modality="flow",
+                train_mode=False,
+                sample_frames_at_test=False,
+                transform=torchvision.transforms.Compose([
+                    GroupScale(config.TRAIN.RESIZE_MIN),
+                    GroupCenterCrop(config.TRAIN.INPUT_SIZE),
+                    GroupNormalize(modality="flow"),
+                    Stack(),
+                ])
+            ),
+            batch_size=config.TEST.BATCH_SIZE,
+            shuffle=False,
+            num_workers=config.WORKERS,
+            pin_memory=config.PIN_MEMORY
+        )
+
+        flow_model_file = config.TEST.MODEL_FLOW
+        if not os.path.exists(flow_model_file):
+            raise FileNotFoundError(flow_model_file, " does not exist")
+        flow_model = load_model(modality="flow", state_dict_file=flow_model_file)
+
+        print("scoring with flow model")
+        targets, flow_predictions = test(flow_model, flow_loader, "flow")
+
+        del flow_model
+
+        targets = targets.cuda(non_blocking=True)
+        flow_top1_accuracy = accuracy(flow_predictions, targets, topk=(1, ))
+        print("flow top1 accuracy: ", flow_top1_accuracy[0].cpu().numpy().tolist())
+
+    if config.TEST.MODALITY == "combined":
+        predictions = torch.stack([rgb_predictions, flow_predictions])
+        predictions_mean = torch.mean(predictions, dim=0)
+        top1accuracy = accuracy(predictions_mean, targets, topk=(1, ))
+        print("combined top1 accuracy: ", top1accuracy[0].cpu().numpy().tolist())
+
+
+if __name__ == "__main__":
+    fire.Fire(run)
--- a/contrib/action_recognition/i3d/train.py
+++ b/contrib/action_recognition/i3d/train.py
@ -0,0 +1,278 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import os
+import time
+import sys
+import numpy as np
+import fire
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torch.optim import lr_scheduler
+from torch.autograd import Variable
+import torchvision
+from torchvision import datasets, transforms
+from tensorboardX import SummaryWriter
+from default import _C as config
+from default import update_config
+
+from videotransforms import (
+    GroupRandomCrop, GroupRandomHorizontalFlip,
+    GroupScale, GroupCenterCrop, GroupNormalize, Stack
+)
+from models.pytorch_i3d import InceptionI3d
+from metrics import accuracy, AverageMeter
+from dataset import I3DDataSet 
+
+
+# to work with vscode debugger https://github.com/joblib/joblib/issues/864
+import multiprocessing
+multiprocessing.set_start_method('spawn', True)
+
+
+def train(train_loader, model, criterion, optimizer, epoch, writer=None):
+    batch_time = AverageMeter()
+    data_time = AverageMeter()
+    losses = AverageMeter()
+    top1 = AverageMeter()
+    top5 = AverageMeter()
+
+    # switch to train mode
+    model.train()
+
+    end = time.time()
+    for step, (input, target) in enumerate(train_loader):
+        # measure data loading time
+        data_time.update(time.time() - end)
+
+        input = input.cuda(non_blocking=True)
+        target = target.cuda(non_blocking=True)
+
+        # compute output
+        output = model(input)
+        output = torch.mean(output, dim=2)
+        loss = criterion(output, target)
+
+        # measure accuracy and record loss
+        prec1, prec5 = accuracy(output, target, topk=(1,5))
+        losses.update(loss.item(), input.size(0))
+        top1.update(prec1[0], input.size(0))
+        top5.update(prec5[0], input.size(0))
+
+        loss = loss / config.TRAIN.GRAD_ACCUM_STEPS
+        
+        loss.backward()
+
+        if step % config.TRAIN.GRAD_ACCUM_STEPS == 0:
+            optimizer.step()
+            optimizer.zero_grad()
+
+        # measure elapsed time
+        batch_time.update(time.time() - end)
+        end = time.time()
+
+        if step % config.TRAIN.PRINT_FREQ == 0:
+            print(('Epoch: [{0}][{1}/{2}], lr: {lr:.5f}\t'
+                  'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
+                  'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
+                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
+                  'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
+                  'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
+                   epoch, step, len(train_loader), batch_time=batch_time,
+                   data_time=data_time, loss=losses, top1=top1, top5=top5, lr=optimizer.param_groups[-1]['lr'])))
+        
+        if writer:
+            writer.add_scalar('train/loss', losses.avg, epoch+1)
+            writer.add_scalar('train/top1', top1.avg, epoch+1)
+            writer.add_scalar('train/top5', top5.avg, epoch+1)
+
+
+def validate(val_loader, model, criterion, epoch, writer=None):
+    batch_time = AverageMeter()
+    losses = AverageMeter()
+    top1 = AverageMeter()
+    top5 = AverageMeter()
+
+    # switch to evaluate mode
+    model.eval()
+
+    with torch.no_grad():
+        end = time.time()
+        for step, (input, target) in enumerate(val_loader):
+            input = input.cuda(non_blocking=True)
+            target = target.cuda(non_blocking=True)
+
+            # compute output
+            output = model(input)
+            output = torch.mean(output, dim=2)
+            loss = criterion(output, target)
+
+            # measure accuracy and record loss
+            prec1, prec5 = accuracy(output, target, topk=(1,5))
+
+            losses.update(loss.item(), input.size(0))
+            top1.update(prec1[0], input.size(0))
+            top5.update(prec5[0], input.size(0))
+
+            # measure elapsed time
+            batch_time.update(time.time() - end)
+            end = time.time()
+
+            if step % config.TEST.PRINT_FREQ == 0:
+                print(('Test: [{0}/{1}]\t'
+                    'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
+                    'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
+                    'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
+                    'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
+                    step, len(val_loader), batch_time=batch_time, loss=losses,
+                    top1=top1, top5=top5)))
+
+        print(('Testing Results: Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f} Loss {loss.avg:.5f}'
+            .format(top1=top1, top5=top5, loss=losses)))
+
+        if writer:
+            writer.add_scalar('val/loss', losses.avg, epoch+1)
+            writer.add_scalar('val/top1', top1.avg, epoch+1)
+            writer.add_scalar('val/top5', top5.avg, epoch+1)
+
+    return losses.avg
+
+
+def run(*options, cfg=None):
+    """Run training and validation of model
+
+    Notes:
+        Options can be passed in via the options argument and loaded from the cfg file
+        Options loaded from default.py will be overridden by options loaded from cfg file
+        Options passed in through options argument will override option loaded from cfg file
+    
+    Args:
+        *options (str,int ,optional): Options used to overide what is loaded from the config. 
+                                      To see what options are available consult default.py
+        cfg (str, optional): Location of config file to load. Defaults to None.
+    """
+    update_config(config, options=options, config_file=cfg)
+
+    print("Training ", config.TRAIN.MODALITY, " model.")
+    print("Batch size:", config.TRAIN.BATCH_SIZE, " Gradient accumulation steps:", config.TRAIN.GRAD_ACCUM_STEPS)
+
+    torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
+
+    torch.manual_seed(config.SEED)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(config.SEED)
+    np.random.seed(seed=config.SEED)
+
+    # Log to tensorboard
+    writer = SummaryWriter(log_dir=config.LOG_DIR)
+
+    # Setup dataloaders
+    train_loader = torch.utils.data.DataLoader(
+        I3DDataSet(
+            data_root=config.DATASET.DIR,
+            split=config.DATASET.SPLIT,
+            sample_frames=config.TRAIN.SAMPLE_FRAMES,
+            modality=config.TRAIN.MODALITY,
+            transform=torchvision.transforms.Compose([
+                GroupScale(config.TRAIN.RESIZE_MIN),
+                GroupRandomCrop(config.TRAIN.INPUT_SIZE),
+                GroupRandomHorizontalFlip(),
+                GroupNormalize(modality=config.TRAIN.MODALITY),
+                Stack(),
+            ])
+        ),
+        batch_size=config.TRAIN.BATCH_SIZE,
+        shuffle=True,
+        num_workers=config.WORKERS,
+        pin_memory=config.PIN_MEMORY
+    )
+
+    val_loader = torch.utils.data.DataLoader(
+        I3DDataSet(
+            data_root=config.DATASET.DIR,
+            split=config.DATASET.SPLIT,
+            modality=config.TRAIN.MODALITY,
+            train_mode=False,
+            transform=torchvision.transforms.Compose([
+                GroupScale(config.TRAIN.RESIZE_MIN),
+                GroupCenterCrop(config.TRAIN.INPUT_SIZE),
+                GroupNormalize(modality=config.TRAIN.MODALITY),
+                Stack(),
+            ]),
+        ),
+        batch_size=config.TEST.BATCH_SIZE,
+        shuffle=False,
+        num_workers=config.WORKERS,
+        pin_memory=config.PIN_MEMORY
+    )
+
+    # Setup model
+    if config.TRAIN.MODALITY == "RGB":
+        channels = 3
+        checkpoint = config.MODEL.PRETRAINED_RGB
+    elif config.TRAIN.MODALITY == "flow":
+        channels = 2
+        checkpoint = config.MODEL.PRETRAINED_FLOW
+    else:
+        raise ValueError("Modality must be RGB or flow")
+
+    i3d_model = InceptionI3d(400, in_channels=channels)
+    i3d_model.load_state_dict(torch.load(checkpoint))
+
+    # Replace final FC layer to match dataset
+    i3d_model.replace_logits(config.DATASET.NUM_CLASSES)
+
+    criterion = torch.nn.CrossEntropyLoss().cuda()
+
+    optimizer = optim.SGD(
+       i3d_model.parameters(), 
+       lr=0.1,
+       momentum=0.9, 
+       weight_decay=0.0000001
+    )
+
+    i3d_model = i3d_model.cuda()
+
+    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
+        optimizer,
+        factor=0.1,
+        patience=2,
+        verbose=True,
+        threshold=1e-4,
+        min_lr=1e-4
+    )
+
+    # Data-parallel
+    devices_lst = list(range(torch.cuda.device_count()))
+    print("Devices {}".format(devices_lst))
+    if len(devices_lst) > 1:
+        i3d_model = torch.nn.DataParallel(i3d_model)
+
+    if not os.path.exists(config.MODEL.CHECKPOINT_DIR):
+        os.makedirs(config.MODEL.CHECKPOINT_DIR)
+    
+    for epoch in range(config.TRAIN.MAX_EPOCHS):
+
+        train(train_loader,
+            i3d_model,
+            criterion,
+            optimizer,
+            epoch,
+            writer
+        )
+
+        if (epoch + 1) % config.TEST.EVAL_FREQ == 0 or epoch == config.TRAIN.MAX_EPOCHS - 1:
+            val_loss = validate(val_loader, i3d_model, criterion, epoch, writer)
+            scheduler.step(val_loss)
+            torch.save(
+                i3d_model.module.state_dict(),
+                config.MODEL.CHECKPOINT_DIR+'/'+config.MODEL.NAME+'_split'+str(config.DATASET.SPLIT)+'_epoch'+str(epoch).zfill(3)+'.pt'
+            )
+
+    writer.close()
+
+
+if __name__ == "__main__":
+    fire.Fire(run)
--- a/contrib/action_recognition/i3d/videotransforms.py
+++ b/contrib/action_recognition/i3d/videotransforms.py
@ -0,0 +1,98 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# Adapted from https://github.com/feiyunzhang/i3d-non-local-pytorch/blob/master/transforms.py
+
+import torchvision
+import random
+from PIL import Image, ImageOps
+import numpy as np
+import numbers
+import math
+import torch
+
+
+class GroupScale(object):
+
+    def __init__(self, size, interpolation=Image.BILINEAR):
+        self.worker = torchvision.transforms.Resize(size, interpolation)
+
+    def __call__(self, img_group):
+        return [self.worker(img) for img in img_group]
+
+
+class GroupRandomCrop(object):
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            self.size = (int(size), int(size))
+        else:
+            self.size = size
+
+    def __call__(self, img_group):
+
+        w, h = img_group[0].size
+        th, tw = self.size
+
+        out_images = list()
+
+        x1 = random.randint(0, w - tw)
+        y1 = random.randint(0, h - th)
+
+        for img in img_group:
+            assert(img.size[0] == w and img.size[1] == h)
+            if w == tw and h == th:
+                out_images.append(img)
+            else:
+                out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
+
+        return out_images
+
+
+class GroupCenterCrop(object):
+    def __init__(self, size):
+        self.worker = torchvision.transforms.CenterCrop(size)
+
+    def __call__(self, img_group):
+        cropped_imgs = [self.worker(img) for img in img_group]
+        return cropped_imgs
+
+
+class GroupRandomHorizontalFlip(object):
+
+    def __call__(self, img_group):
+        v = random.random()
+        if v < 0.5:
+            ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
+            return ret
+        else:
+            return img_group
+
+
+class GroupNormalize(object):
+    
+    def __init__(self, modality, means=[0.485, 0.456, 0.406], stds=[0.229, 0.224, 0.225]):
+        self.modality = modality
+        self.means = means
+        self.stds = stds
+        self.tensor_worker = torchvision.transforms.ToTensor()
+        self.norm_worker = torchvision.transforms.Normalize(mean=means, std=stds)
+
+    def __call__(self, img_group):
+        if self.modality == "RGB":
+            # Convert images to tensors in range [0, 1]
+            img_tensors = [self.tensor_worker(img) for img in img_group]
+            # Normalize to imagenet means and stds
+            img_tensors = [self.norm_worker(img) for img in img_tensors]
+        else:
+            # Convert images to numpy arrays
+            img_arrays = [np.asarray(img).transpose([2, 0, 1]) for img in img_group]
+            # Scale to [-1, 1] and convert to tensor
+            img_tensors = [torch.from_numpy((img / 255.) * 2 - 1) for img in img_arrays]
+        return img_tensors
+
+
+class Stack(object):
+
+    def __call__(self, img_tensors):
+        # Stack tensors and permute from D x C x H x W to C x D x H x W
+        return torch.stack(img_tensors, dim=0).permute(1, 0, 2, 3).float()
--- a/contrib/action_recognition/media/action_recognition.gif
+++ b/contrib/action_recognition/media/action_recognition.gif
--- a/contrib/action_recognition/media/inference_speeds.png
+++ b/contrib/action_recognition/media/inference_speeds.png
--- a/contrib/action_recognition/media/publications.png
+++ b/contrib/action_recognition/media/publications.png
--- a/contrib/action_recognition/video_annotation/README.md
+++ b/contrib/action_recognition/video_annotation/README.md
@ -0,0 +1,53 @@
+# Video Annotation Summary For Action Recognition
+
+To create a training or evaluation set for action recognition, the ground truth start/end position of actions in videos needs to be annotated. We looked into various tools for this and the tool we liked most (by far) is called [VGG Image Annotator (VIA)](http://www.robots.ox.ac.uk/~vgg/software/via/) written by the VGG group at Oxford.
+
+
+## Instructions For Using VIA Tool
+
+We will now provide a few tips/steps how to use the VIA tool. A fully functioning live demo of the tool can be found [here](http://www.robots.ox.ac.uk/~vgg/software/via/demo/via_video_annotator.html).
+
+![](./media/fig3.png)
+<p align="center">Screenshot of VIA Tool</p>
+
+How to use the tool for action recognition:
+- Step 1: Download the zip file from the link [here](http://www.robots.ox.ac.uk/~vgg/software/via/downloads/via-1.0.6.zip).
+- Step 2: Unzip the tool and open the *via_video_annotator.html* to open the annotation tool. *Note: support for some browsers seems not fully stable - we found Chrome to work best.*
+- Step 3: Import the video file(s) from local using ![](./media/fig4.png) or from url using ![](./media/fig5.png).
+- Step 4: Use ![](./media/fig1.png) to create a new attribute for action annotation. Select *Temporal Segment in Video or Audio* for *Anchor*. To see the created attribute, click ![](./media/fig1.png) again.
+- Step 5: Update the *Timeline List* with the actions you want to track. Split different actions by e.g "1. eat, 2. drink" for two tracks separately for *eat* and *drink*. Click *update* to see the updated tracks.
+- Step 6: Click on one track to add segment annotations for a certain action. Use key `a` to add the temporal segment at the current time and `Shift + a` to update the edge of the temporal segment to the current time.
+- Step 7: Export the annotations using ![](./media/fig2.png). Select *Only Temporal Segments as CSV* if you only have temporal segments annotations.
+
+## Scripts for use with the VIA Tool
+
+The VIA tool outputs annotations as a csv file. Often however we need each annotated action to be written as its own clip and into separate files. For this, we provide some utility functions (in this folder) which help with:
+- Extraction of each action as "positive" clip, and "negative" clips defined as video segments where no action-of-interest occurs.
+- Conversion of the video clips to a format which the VIA tool knows how to read.
+
+
+
+## Annotation Tools Comparison
+
+Below is a list of alternative UIs for annotating actions, however in our opinion the VIA tool is the by far best performer. We distinguish between:
+- Fixed-length clip annotation: where the UI splits the video into fixed-length clips, and the user then annotates the clips.
+- Segmentations annotation: where the user annotates the exact start and end position of each action directly. This is more time-consuming compared to fixed-length clip annotation, however comes with higher localization accuracy.
+
+See also the [HACS Dataset web page](http://hacs.csail.mit.edu/index.html#explore) for some examples showing these two types of annotations.
+
+
+
+| Tool Name      | Annotation Type |Pros|Cons|Whether Open Source|
+| ----------- | ----------- |----------- |----------- |----------- |
+| [MuViLab](https://github.com/ale152/muvilab)      | Fixed-length clips annotation       | <ul><li> Accelerate clip annotation by displaying many clips at the same time</li><br><li> Especially helpful when the actions are sparse</li></ul>| <ul><li> Not useful when the actions are very short (eg a second)</li></ul>|Open source on Github|
+| [VIA (VGG Image Annotator)](http://www.robots.ox.ac.uk/~vgg/software/via/)   | segmentations annotation|<ul><li> Light-weight, no prerequisite besides downloading a zip file</li> <br> <li> Actively developed Gitlab project </li><br> <li> Support for: annotating video in high precision(on milliseconds and frame), previewing the annotated clips, export start and end time of the actions to csv, annotating multiple actions in different track on the same video </li><br> <li> Easy to ramp-up and use</li></ul>|<ul><li> Code can be instabilities, e.g sometimes the tool becomes unresponsive.</li></ul>|Open source on Gitlab|
+|[ANVIL](http://www.anvil-software.org/#)|Segmentations annotation|<ul> <li> Support for high precision annotation, export the start and end time.</li></ul>| <ul><li> Heavier prerequisite with Java required </li><br> <li> Harder to ramp-up compared to VIA with lots of specifications, etc. </li><br> <li> Java-related issues can make the tool difficult to run. </li></ul>|Not open source, but free to download|
+|[Action Annotation Tool](https://github.com/devyhia/action-annotation)| Segmentations annotation|<ul><li> Add labels to key frames in video</li> <br> <li> Support high precision to milliseconds</li></ul>|<ul><li> Much less convenient compared to VIA or ANVIL</li>  <br> <li> Not in active delevepment</li></ul>|Open source on Github|
+
+
+
+## References
+- [Deep Learning for Videos: A 2018 Guide to Action Recognition.](http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review#targetText=Action%20recognition%20by%20dense%20trajectories,Trajectories%20by%20Wang%20et%20al)
+- [Zhao, H., et al. "Hacs: Human action clips and segments dataset for recognition and temporal localization." arXiv preprint arXiv:1712.09374 (2019).](https://arxiv.org/abs/1712.09374)
+- [Kay, Will, et al. "The kinetics human action video dataset." arXiv preprint arXiv:1705.06950 (2017).](https://arxiv.org/abs/1705.06950)
+- [Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), October 21–25, 2019, Nice, France. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3343031.3350535.](https://arxiv.org/abs/1904.10699)
--- a/contrib/action_recognition/video_annotation/clip_extraction.py
+++ b/contrib/action_recognition/video_annotation/clip_extraction.py
@ -0,0 +1,148 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# prerequisite:
+# (1) download and extract ffmpeg: https://github.com/adaptlearning/adapt_authoring/wiki/Installing-FFmpeg
+# (2) make sure the ffmpeg is in your system's env variable: path
+# the script depend on the following fixed things of the csv
+# skiprows=1
+
+import argparse
+import ast
+import os
+import sys
+
+import pandas as pd
+
+sys.path.append("lib")
+from video_annotation_utils import (
+    create_clip_file_name,
+    get_clip_action_label,
+    extract_clip,
+    extract_negative_samples_per_file,
+)
+
+
+def main(
+    annotation_filepath,
+    has_header,
+    video_dir,
+    clip_dir,
+    label_filepath,
+    clip_format,
+    clip_margin,
+    clip_length,
+):
+    # set pandas display
+    pd.set_option("display.max_columns", 500)
+    pd.set_option("display.width", 1000)
+
+    if has_header:
+        skiprows = 1
+    else:
+        skiprows = 0
+
+    # read in the start time and end time of the clips while removing the records with no label related
+    video_info_df = pd.read_csv(annotation_filepath, skiprows=skiprows)
+    video_info_df = video_info_df.loc[video_info_df["metadata"] != "{}"]
+
+    # create clip file name and label
+    video_info_df["clip_file_name"] = video_info_df.apply(
+        lambda x: create_clip_file_name(x, clip_file_format=clip_format),
+        axis=1,
+    )
+    video_info_df["clip_action_label"] = video_info_df.apply(
+        lambda x: get_clip_action_label(x), axis=1
+    )
+
+    # remove the clips with action label as '_DEFAULT'
+    video_info_df = video_info_df.loc[
+        video_info_df["clip_action_label"] != "_DEFAULT"
+    ]
+
+    # script-input
+    video_info_df.apply(lambda x: extract_clip(x, video_dir, clip_dir), axis=1)
+
+    # write the label
+    video_info_df[["clip_file_name", "clip_action_label"]].to_csv(
+        label_filepath, index=False
+    )
+
+    # Extract negative samples
+    # add column with file name
+    video_info_df["video_file"] = video_info_df.apply(
+        lambda x: ast.literal_eval(x.file_list)[0], axis=1
+    )
+    negative_clip_dir = os.path.join(clip_dir, "negative_samples")
+    video_file_list = list(video_info_df["video_file"].unique())
+    negative_sample_info_df = pd.DataFrame()
+    for video_file in video_file_list:
+        res_df = extract_negative_samples_per_file(
+            video_file,
+            video_dir,
+            video_info_df,
+            negative_clip_dir,
+            clip_format,
+            ignore_clip_length=clip_margin,
+            clip_length=clip_length,
+            skip_clip_length=clip_margin,
+        )
+
+        negative_sample_info_df = negative_sample_info_df.append(res_df)
+
+    negative_sample_info_df.to_csv(
+        os.path.join(negative_clip_dir, "negative_clip_info.csv"), index=False
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-A", "--annotation_filepath", help="CSV filepath from the annotator"
+    )
+    parser.add_argument(
+        "-H",
+        "--has_header",
+        help="Set if the annotation file has header",
+        action="store_true",
+    )
+    parser.add_argument("-I", "--input_dir", help="Input video dir")
+    parser.add_argument(
+        "-O",
+        "--output_dir",
+        help="Output dir where the extracted clips will be stored",
+        default="./outputs",
+    )
+    parser.add_argument(
+        "-L",
+        "--label_filepath",
+        help="Path where the label csv will be stored",
+        default="./outputs/labels.csv",
+    )
+    parser.add_argument("-F", "--clip_format", default="mp4")
+    parser.add_argument(
+        "-M",
+        "--clip_margin",
+        type=float,
+        help="The length around the positive samples to be ignored for negative sampling",
+        default=3.0,
+    )
+    parser.add_argument(
+        "-T",
+        "--clip_length",
+        type=float,
+        help="The length of negative samples to extract",
+        default=2.0,
+    )
+    args = parser.parse_args()
+
+    main(
+        annotation_filepath=args.annotation_filepath,
+        has_header=args.has_header,
+        video_dir=args.input_dir,
+        clip_dir=args.output_dir,
+        label_filepath=args.label_filepath,
+        clip_format=args.clip_format,
+        clip_margin=args.clip_margin,
+        clip_length=args.clip_length,
+    )
--- a/contrib/action_recognition/video_annotation/lib/video_annotation_utils.py
+++ b/contrib/action_recognition/video_annotation/lib/video_annotation_utils.py
@ -0,0 +1,433 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import ast
+import os
+import numpy as np
+import pandas as pd
+
+
+# transform the encoded video:
+def video_format_conversion(video_path, output_path, h264_format=False):
+    """
+    Encode video in a different format.
+
+    :param video_path: str.
+        Path to input video
+    :param output_path: str.
+        Path where converted video will be written to.
+    :param h264_format: boolean.
+        Set to true to save time if input is in h264_format.
+    :return: None.
+    """
+
+    if not h264_format:
+        subprocess.run(
+            [
+                "ffmpeg",
+                "-i",
+                video_path,
+                "-c",
+                "copy",
+                "-map",
+                "0",
+                output_path,
+            ]
+        )
+    else:
+        subprocess.run(
+            ["ffmpeg", "-i", video_path, "-vcodec", "libx264", output_path]
+        )
+
+
+def create_clip_file_name(row, clip_file_format="mp4"):
+    """
+    Create the output clip file name.
+
+    :param row: pandas.Series.
+        One row of the video annotation output from the VIA tool.
+        This function requires the output from VIA tool contains a column '# CSV_HEADER = metadata_id'.
+    :param clip_file_format: str.
+        The format of the output clip file.
+    :return: str.
+        The output clip file name.
+    """
+    video_file = ast.literal_eval(row.file_list)[0]
+    clip_id = row["# CSV_HEADER = metadata_id"]
+    clip_file = "{}_{}.{}".format(video_file, clip_id, clip_file_format)
+    return clip_file
+
+
+def get_clip_action_label(row):
+    """
+    Get the action label of the positive clips.
+    This function requires the output from VIA tool contains a column 'metadata'.
+
+    :param row: pandas.Series.
+        One row of the video annotation output.
+    :return: str.
+    """
+    label_dict = ast.literal_eval(row.metadata)
+    track_key = list(label_dict.keys())[0]
+    return label_dict[track_key]
+
+
+def _extract_clip_ffmpeg(
+    start_time, duration, video_path, clip_path, ffmpeg_path=None
+):
+    """
+    Using ffmpeg to extract clip from the video based on the start time and duration of the clip.
+
+    :param start_time: float.
+        The start time of the clip.
+    :param duration: float.
+        The duration of the clip.
+    :param video_path: str.
+        The path of the input video.
+    :param clip_path: str.
+        The path of the output clip.
+    :param ffmpeg_path: str.
+        The path of the ffmpeg. This is optional, which you could use when you have not added the
+        ffmpeg to the path environment variable.
+    :return: None.
+    """
+
+    subprocess.run(
+        [
+            os.path.join(ffmpeg_path, "ffmpeg")
+            if ffmpeg_path is not None
+            else "ffmpeg",
+            "-ss",
+            str(start_time),
+            "-i",
+            video_path,
+            "-t",
+            str(duration),
+            clip_path,
+            "-codec",
+            "copy",
+            "-y",
+        ]
+    )
+
+
+def extract_clip(row, video_dir, clip_dir, ffmpeg_path=None):
+    """
+    Extract the postivie clip based on a row of the output annotation file.
+
+    :param row: pandas.Series.
+        One row of the video annotation output.
+    :param video_dir: str.
+        The directory of the input videos.
+    :param clip_dir: str.
+        The directory of the output positive clips.
+    :param ffmpeg_path: str.
+        The path of the ffmpeg. This is optional, which you could use when you have not added the
+        ffmpeg to the path environment variable.
+    :return: None.
+    """
+
+    if not os.path.exists(clip_dir):
+        os.makedirs(clip_dir)
+
+    # there are two different output of the csv from the VIA showing the annotation start and end
+    # (1) in two columns: temporal_segment_start and temporal_segment_end
+    # (2) in one column: temporal_coordinates
+    if "temporal_segment_start" in row.index:
+        start_time = row.temporal_segment_start
+        if "temporal_segment_end" not in row.index:
+            raise ValueError(
+                "There is no column named 'temporal_segment_end'. Cannot get the full details "
+                "of the action temporal intervals."
+            )
+        end_time = row.temporal_segment_end
+    elif "temporal_coordinates" in row.index:
+        start_time, end_time = ast.literal_eval(row.temporal_coordinates)
+    else:
+        raise Exception("There is no temporal information in the csv.")
+
+    clip_sub_dir = os.path.join(clip_dir, row.clip_action_label)
+    if not os.path.exists(clip_sub_dir):
+        os.makedirs(clip_sub_dir)
+
+    duration = end_time - start_time
+    video_file = ast.literal_eval(row.file_list)[0]
+    video_path = os.path.join(video_dir, video_file)
+    clip_file = row.clip_file_name
+    clip_path = os.path.join(clip_sub_dir, clip_file)
+
+    if not os.path.exists(video_path):
+        raise ValueError(
+            "The video path '{}' is not valid.".format(video_path)
+        )
+
+    # ffmpeg -ss 9.222 -i youtube.mp4 -t 0.688 tmp.mp4 -codec copy -y
+    _extract_clip_ffmpeg(
+        start_time, duration, video_path, clip_path, ffmpeg_path
+    )
+
+
+def get_video_length(video_file_path):
+    """
+    Get the video length in milliseconds.
+
+    :param video_file_path: str.
+        The path of the video file.
+    :return: (str, str).
+        Tuple of video duration (in string), and error message of the ffprobe command if any.
+    """
+    cmd_list = [
+        "ffprobe",
+        "-v",
+        "error",
+        "-show_entries",
+        "format=duration",
+        "-of",
+        "default=noprint_wrappers=1:nokey=1",
+        video_file_path,
+    ]
+    result = subprocess.run(
+        cmd_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+    )
+    if len(result.stderr) > 0:
+        raise RuntimeError(result.stderr)
+
+    return float(result.stdout)
+
+
+def _merge_temporal_interval(temporal_interval_list):
+    """
+    Merge the temporal intervals in the input temporal interval list. This is for situations
+    when different actions have overlap temporal interval. e.g if the input temporal interval list
+    is [(1.0, 3.0), (2.0, 4.0)], then [(1.0, 4.0)] will be returned.
+
+    :param temporal_interval_list: list of tuples.
+        List of tuples with (temporal interval start time, temporal interval end time).
+    :return: list of tuples.
+        The merged temporal interval list.
+    """
+    # sort by the temporal interval start
+    temporal_interval_list_sorted = sorted(
+        temporal_interval_list, key=lambda x: x[0]
+    )
+    i = 0
+
+    while i < len(temporal_interval_list_sorted) - 1:
+        a1, b1 = temporal_interval_list_sorted[i]
+        a2, b2 = temporal_interval_list_sorted[i + 1]
+        if a2 <= b1:
+            del temporal_interval_list_sorted[i]
+            temporal_interval_list_sorted[i] = [a1, max(b1, b2)]
+        else:
+            i += 1
+    return temporal_interval_list_sorted
+
+
+def _split_interval(
+    interval,
+    left_ignore_clip_length,
+    right_ignore_clip_length,
+    clip_length,
+    skip_clip_length=0,
+):
+    """
+    Split the negative sample interval into the sub-intervals which will serve as the start and end of
+    the negative sample clips.
+
+    :param interval: tuple of (float, float).
+        Tuple of start and end of the negative sample interval.
+    :param left_ignore_clip_length: float.
+        The clip length to ignore in the left/start of the interval. This is used to avoid creating
+        negative sample clips with edges too close to positive samples. The same applies to right_ignore_clip_length.
+    :param right_ignore_clip_length: float.
+        The clip length to ignore in the right/end of the interval.
+    :param clip_length: float.
+        The clip length of the created negative clips.
+    :param skip_clip_length: float.
+        The skipped video length between two negative samples, this can be used to reduce the
+        number of the negative samples.
+    :return: list of tuples.
+        List of start and end time of the negative clips.
+    """
+    left, right = interval
+    if (left_ignore_clip_length + right_ignore_clip_length) >= (right - left):
+        return []
+    new_left = left + left_ignore_clip_length
+    new_right = right - right_ignore_clip_length
+
+    if new_right - new_left < clip_length:
+        return []
+
+    interval_start_list = np.arange(
+        new_left, new_right, clip_length + skip_clip_length
+    )
+    interval_end_list = interval_start_list + clip_length
+
+    if interval_end_list[-1] > new_right:
+        interval_start_list = interval_start_list[:-1]
+        interval_end_list = interval_end_list[:-1]
+
+    res = list(zip(list(interval_start_list), list(interval_end_list)))
+    return res
+
+
+def _split_interval_list(
+    interval_list,
+    left_ignore_clip_length,
+    right_ignore_clip_length,
+    clip_length,
+    skip_clip_length=0,
+):
+    """
+    Taking the interval list of the eligible negative sample time intervals, return the list of the
+    start time and the end time of the negative clips.
+
+    :param interval_list: list of tuples.
+        List of the tuples containing the start time and end time of the eligible negative
+        sample time intervals.
+    :param left_ignore_clip_length: float.
+        See split_interval.
+    :param right_ignore_clip_length: float.
+        See split_interval.
+    :param clip_length: float.
+        See split_interval.
+    :param skip_clip_length: float.
+        See split_interval
+    :return: list of tuples.
+        List of start and end time of the negative clips.
+    """
+    interval_res = []
+    for i in range(len(interval_list)):
+        interval_res += _split_interval(
+            interval_list[i],
+            left_ignore_clip_length=left_ignore_clip_length,
+            right_ignore_clip_length=right_ignore_clip_length,
+            clip_length=clip_length,
+            skip_clip_length=skip_clip_length,
+        )
+    return interval_res
+
+
+def extract_negative_samples_per_file(
+    video_file,
+    video_dir,
+    video_info_df,
+    negative_clip_dir,
+    clip_file_format,
+    ignore_clip_length,
+    clip_length,
+    ffmpeg_path=None,
+    skip_clip_length=0,
+):
+    """
+    Extract the negative sample for a single video file.
+
+    :param video_file: str.
+        The name of the input video file.
+    :param video_dir: str.
+        The directory of the input video.
+    :param video_info_df: pandas.DataFrame.
+        The data frame which contains the video annotation output.
+    :param negative_clip_dir: str.
+        The directory of the output negative clips.
+    :param clip_file_format: str.
+        The format of the output negative clips.
+    :param ignore_clip_length: float.
+        The clip length to ignore in the left/start of the interval. This is used to avoid creating
+        negative sample clips with edges too close to positive samples.
+    :param clip_length: float.
+        The clip length of the created negative clips.
+    :param ffmpeg_path: str.
+        The path of the ffmpeg. This is optional, which you could use when you have not added the
+        ffmpeg to the path environment variable.
+    :param skip_clip_length: float.
+        The skipped video length between two negative samples, this can be used to reduce the
+        number of the negative samples.
+    :return: pandas.DataFrame.
+        The data frame which contains start and end time of the negative clips.
+    """
+
+    # get the length of the video
+    video_file_path = os.path.join(video_dir, video_file)
+    video_duration = get_video_length(video_file_path)
+
+    # get the actions intervals
+    if "temporal_coordinates" in video_info_df.columns:
+        temporal_interval_series = video_info_df.loc[
+            video_info_df["video_file"] == video_file, "temporal_coordinates"
+        ]
+        temporal_interval_list = temporal_interval_series.apply(
+            lambda x: ast.literal_eval(x)
+        ).tolist()
+    elif "temporal_segment_start" in video_info_df.columns:
+        video_start_list = video_info_df.loc[
+            video_info_df["video_file"] == video_file, "temporal_segment_start"
+        ].to_list()
+        video_end_list = video_info_df.loc[
+            video_info_df["video_file"] == video_file, "temporal_segment_end"
+        ].to_list()
+        temporal_interval_list = list(zip(video_start_list, video_end_list))
+    else:
+        raise Exception("There is no temporal information in the csv.")
+
+    if not all(
+        len(temporal_interval) % 2 == 0
+        for temporal_interval in temporal_interval_list
+    ):
+        raise ValueError(
+            "There is at least one time interval "
+            "in {} having only one end point.".format(
+                str(temporal_interval_list)
+            )
+        )
+
+    temporal_interval_list = _merge_temporal_interval(temporal_interval_list)
+    negative_sample_interval_list = (
+        [0.0]
+        + [t for interval in temporal_interval_list for t in interval]
+        + [video_duration]
+    )
+
+    negative_sample_interval_list = [
+        [
+            negative_sample_interval_list[2 * i],
+            negative_sample_interval_list[2 * i + 1],
+        ]
+        for i in range(len(negative_sample_interval_list) // 2)
+    ]
+
+    clip_interval_list = _split_interval_list(
+        negative_sample_interval_list,
+        left_ignore_clip_length=ignore_clip_length,
+        right_ignore_clip_length=ignore_clip_length,
+        clip_length=clip_length,
+        skip_clip_length=skip_clip_length,
+    )
+
+    if not os.path.exists(negative_clip_dir):
+        os.makedirs(negative_clip_dir)
+
+    negative_clip_file_list = []
+    for i, clip_interval in enumerate(clip_interval_list):
+        start_time = clip_interval[0]
+        duration = clip_interval[1] - clip_interval[0]
+        negative_clip_file = "{}_{}.{}".format(video_file, i, clip_file_format)
+        negative_clip_file_list.append(negative_clip_file)
+        negative_clip_path = os.path.join(
+            negative_clip_dir, negative_clip_file
+        )
+        _extract_clip_ffmpeg(
+            start_time,
+            duration,
+            video_file_path,
+            negative_clip_path,
+            ffmpeg_path,
+        )
+
+    return pd.DataFrame(
+        {
+            "negative_clip_file_name": negative_clip_file_list,
+            "clip_duration": clip_interval_list,
+            "video_file": video_file,
+        }
+    )
--- a/contrib/action_recognition/video_annotation/media/fig1.png
+++ b/contrib/action_recognition/video_annotation/media/fig1.png
--- a/contrib/action_recognition/video_annotation/media/fig2.png
+++ b/contrib/action_recognition/video_annotation/media/fig2.png
--- a/contrib/action_recognition/video_annotation/media/fig3.png
+++ b/contrib/action_recognition/video_annotation/media/fig3.png
--- a/contrib/action_recognition/video_annotation/media/fig4.png
+++ b/contrib/action_recognition/video_annotation/media/fig4.png
--- a/contrib/action_recognition/video_annotation/media/fig5.png
+++ b/contrib/action_recognition/video_annotation/media/fig5.png
--- a/contrib/action_recognition/video_annotation/tests/unit/test_video_annotation_utils.py
+++ b/contrib/action_recognition/video_annotation/tests/unit/test_video_annotation_utils.py
@ -0,0 +1,152 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import ast
+import math
+import os
+import sys
+
+sys.path.append("../")
+
+import pandas as pd
+import pytest
+
+from video_annotation.video_annotation_utils import (
+    # Usually don't test private functions.
+    # But as clip-extraction results are tricky to test, we test some of the private functions here.
+    _merge_temporal_interval,
+    _split_interval_list,
+    create_clip_file_name,
+    extract_clip,
+    extract_negative_samples_per_file,
+    get_clip_action_label,
+    get_video_length,
+)
+
+
+VIDEO_DIR = os.path.join("tests", "data")
+SAMPLE_VIDEO1_FILE = "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4"
+SAMPLE_VIDEO1_PATH = os.path.join(VIDEO_DIR, SAMPLE_VIDEO1_FILE)
+SAMPLE_ANNOTATION_FILE = "Unnamed-VIA Project19Sep2019_18h42m15s_export.csv"
+SAMPLE_ANNOTATION_PATH = os.path.join(VIDEO_DIR, SAMPLE_ANNOTATION_FILE)
+FRAME_PER_SECOND = 30
+
+
+@pytest.fixture
+def annotation_df():
+    video_info_df = pd.read_csv(SAMPLE_ANNOTATION_PATH, skiprows=1)
+    return video_info_df.loc[video_info_df["metadata"] != "{}"]
+
+
+def test_create_clip_file_name(annotation_df):
+    row1 = annotation_df.iloc[0]
+    file1 = create_clip_file_name(row1, clip_file_format="mp4")
+    assert file1 == "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4_1_zCXg2CQ5.mp4"
+    file2 = create_clip_file_name(row1, clip_file_format="avi")
+    assert file2 == "3173 1-7 Cold 2019-08-19_13_56_14_787.mp4_1_zCXg2CQ5.avi"
+
+
+def test_get_clip_action_label(annotation_df):
+    row1 = annotation_df.iloc[0]
+    assert get_clip_action_label(row1) == "1.action_1"
+
+
+def test_extract_clip(annotation_df, tmp_path):
+    row1 = annotation_df.iloc[0].copy()
+    row1["clip_action_label"] = get_clip_action_label(row1)
+    row1["clip_file_name"] = create_clip_file_name(
+        row1, clip_file_format="mp4"
+    )
+    extract_clip(
+        row=row1, video_dir=VIDEO_DIR, clip_dir=tmp_path, ffmpeg_path=None
+    )
+    output_clip_path = os.path.join(
+        tmp_path, row1["clip_action_label"], row1["clip_file_name"]
+    )
+
+    # Test if extracted positive clip length is the same as the annotated segment length
+    assert (
+        abs(
+            get_video_length(output_clip_path)
+            - (row1.temporal_segment_end - row1.temporal_segment_start)
+        )
+        <= 1 / FRAME_PER_SECOND
+    )
+
+
+def test_extract_negative_samples_per_file(annotation_df, tmp_path):
+    """TODO This function should test two things which are missing now:
+    1. assert if the extracted negative samples are not overlapped with any positive samples
+    2. assert the number of extracted negative samples are correct
+    """
+    video_df = annotation_df.copy()
+    video_df["video_file"] = video_df.apply(
+        lambda x: ast.literal_eval(x.file_list)[0], axis=1
+    )
+    clip_length = 2
+    extract_negative_samples_per_file(
+        video_file=SAMPLE_VIDEO1_FILE,
+        video_dir=VIDEO_DIR,
+        video_info_df=video_df,
+        negative_clip_dir=tmp_path,
+        clip_file_format="mp4",
+        ignore_clip_length=0,
+        clip_length=clip_length,
+        ffmpeg_path=None,
+        skip_clip_length=0,
+    )
+    for i in range(4):
+        negative_clip_length = get_video_length(
+            os.path.join(tmp_path, "{}_{}.mp4".format(SAMPLE_VIDEO1_FILE, i))
+        )
+        assert abs(negative_clip_length - clip_length) <= 1 / FRAME_PER_SECOND
+
+
+def test_get_video_length():
+    assert get_video_length(SAMPLE_VIDEO1_PATH) == 18.719
+
+
+def test_merge_temporal_interval():
+    interval_list1 = [(1, 2.5), (1.5, 2), (0.5, 1.5)]
+    merged_interval_list1 = _merge_temporal_interval(interval_list1)
+    assert merged_interval_list1 == [[0.5, 2.5]]
+
+    interval_list2 = [(-1.1, 0), (0, 1.2), (4.5, 7), (6.8, 8.5)]
+    merged_interval_list2 = _merge_temporal_interval(interval_list2)
+    assert merged_interval_list2 == [[-1.1, 1.2], [4.5, 8.5]]
+
+
+def _float_tuple_close(input_tuple1, input_tuple2):
+    return all(
+        math.isclose(input_tuple1[i], input_tuple2[i])
+        for i in range(len(input_tuple1))
+    )
+
+
+def _float_tuple_list_close(input_tuple_list1, input_tuple_list2):
+    return all(
+        _float_tuple_close(input_tuple_list1[i], input_tuple_list2[i])
+        for i in range(len(input_tuple_list1))
+    )
+
+
+def test_split_interval_list():
+    interval_list1 = [(0.5, 3.0), (5.0, 9.0)]
+    res1 = _split_interval_list(
+        interval_list1,
+        left_ignore_clip_length=0.3,
+        right_ignore_clip_length=0.5,
+        clip_length=0.6,
+        skip_clip_length=0.1,
+    )
+    assert _float_tuple_list_close(
+        res1,
+        [
+            (0.8, 1.4),
+            (1.5, 2.1),
+            (5.3, 5.9),
+            (6.0, 6.6),
+            (6.7, 7.3),
+            (7.4, 8.0),
+        ],
+    )
--- a/contrib/action_recognition/video_annotation/video_conversion.py
+++ b/contrib/action_recognition/video_annotation/video_conversion.py
@ -0,0 +1,47 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""
+ffmpeg video conversion from 'asf' to 'mp4' and 'avi' without video quality loss:
+referenced stackoverflow answer:
+https://stackoverflow.com/questions/15049829/remux-to-mkv-but-add-all-streams-using-ffmpeg/15052662#15052662
+"""
+
+import argparse
+import os
+
+sys.path.append("lib")
+from video_annotation_utils import video_format_conversion
+
+
+def main(video_dir, output_dir):
+    for output_format in ["mp4", "avi"]:
+        output_sub_dir = os.path.join(output_dir, output_format)
+        if not os.path.exists(output_sub_dir):
+            os.makedirs(output_sub_dir)
+
+        # get all the files in the directory
+        for video_file in os.listdir(video_dir):
+            if video_file[-3:] == "asf":
+                video_path = os.path.join(video_dir, video_file)
+                output_file_name = video_file[:-4] + ".{}".format(
+                    output_format
+                )
+                output_path = os.path.join(output_sub_dir, output_file_name)
+                video_format_conversion(
+                    video_path, output_path, h264_format=True
+                )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-i", "--input_dir", help="Input video dir")
+    parser.add_argument(
+        "-o",
+        "--output_dir",
+        help="Output dir where the converted videos will be stored",
+        default="./outputs",
+    )
+    args = parser.parse_args()
+
+    main(args.input_dir, args.output_dir)
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@ -0,0 +1,62 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+ARG ENV="cpu"
+ARG HOME="/root"
+
+FROM ubuntu:18.04 AS cpu
+
+ARG HOME
+ENV HOME="${HOME}"
+WORKDIR ${HOME}
+
+# Install base dependencies
+RUN apt-get update && \
+    apt-get install -y curl git build-essential
+
+# Install Anaconda
+ARG ANACONDA="https://repo.continuum.io/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh"
+RUN curl ${ANACONDA} -o anaconda.sh && \
+    /bin/bash anaconda.sh -b -p conda && \
+    rm anaconda.sh 
+ENV PATH="${HOME}/conda/envs/cv/bin:${PATH}"
+
+# Clone Computer Vision repo
+ARG BRANCH="master"
+RUN git clone --depth 1 --single-branch -b ${BRANCH} https://github.com/microsoft/computervision
+
+# Setup Jupyter notebook configuration
+ENV NOTEBOOK_CONFIG="${HOME}/.jupyter/jupyter_notebook_config.py"
+RUN mkdir ${HOME}/.jupyter && \
+    echo "c.NotebookApp.token = ''" >> ${NOTEBOOK_CONFIG} && \
+    echo "c.NotebookApp.ip = '0.0.0.0'" >> ${NOTEBOOK_CONFIG} && \
+    echo "c.NotebookApp.allow_root = True" >> ${NOTEBOOK_CONFIG} && \
+    echo "c.NotebookApp.open_browser = False" >> ${NOTEBOOK_CONFIG} && \
+    echo "c.MultiKernelManager.default_kernel_name = 'python3'" >> ${NOTEBOOK_CONFIG}
+
+
+# GPU Stage
+FROM nvidia/cuda:9.0-base AS gpu
+
+ARG HOME
+WORKDIR ${HOME}
+
+COPY --from=cpu ${HOME} .
+
+ENV PATH="${HOME}/conda/envs/cv/bin:${PATH}"
+
+
+# Final Stage
+FROM $ENV AS final
+
+# Install Conda dependencies
+RUN conda env create -f computervision/environment.yml && \
+    conda clean -fay && \
+    python -m ipykernel install --user --name 'cv' --display-name 'python3'
+
+ARG HOME
+WORKDIR ${HOME}/computervision
+
+EXPOSE 8888
+CMD ["jupyter", "notebook"]
+
--- a/docker/README.md
+++ b/docker/README.md
@ -0,0 +1,60 @@
+Docker Support
+==============
+The Dockerfile in this directory will build Docker images with all the dependencies and code needed to run example notebooks or unit tests included in this repository.
+
+Multiple environments are supported by using [multistage builds](https://docs.docker.com/develop/develop-images/multistage-build/). In order to efficiently build the Docker images in this way, [Docker BuildKit](https://docs.docker.com/develop/develop-images/build_enhancements/) is necessary.
+The following examples show how to build and run the Docker image for CPU and GPU environments. Note on some platforms, one needs to manually specify the environment variable for `DOCKER_BUILDKIT`to make sure the build runs well. For example, on a Windows machine, this can be done by the powershell command as below, before building the image
+```
+$env:DOCKER_BUILDKIT=1
+```
+
+Once the container is running you can access Jupyter notebooks at http://localhost:8888.
+
+Building and Running with Docker
+--------------------------------
+
+<details>
+<summary><strong><em>CPU environment</em></strong></summary>
+
+```
+DOCKER_BUILDKIT=1 docker build -t computervision:cpu --build-arg ENV="cpu" .
+docker run -p 8888:8888 -d computervision:cpu
+```
+
+</details>
+
+<details>
+<summary><strong><em>GPU environment</em></strong></summary>
+
+```
+DOCKER_BUILDKIT=1 docker build -t computervision:gpu --build-arg ENV="gpu" .
+docker run --runtime=nvidia -p 8888:8888 -d computervision:gpu
+```
+
+</details>
+
+Build Arguments
+---------------
+
+There are several build arguments which can change how the image is built. Similar to the `ENV` build argument these are specified during the docker build command.
+
+Build Arg|Description|
+---------|-----------|
+ENV|Environment to use, options: cpu, gpu|
+BRANCH|Git branch of the repo to use (defaults to `master`)
+ANACONDA|Anaconda installation script (defaults to miniconda3 4.6.14)|
+
+Example using the staging branch:
+
+```
+DOCKER_BUILDKIT=1 docker build -t computervision:cpu --build-arg ENV="cpu" --build-arg BRANCH="staging" .
+```
+
+In order to see detailed progress with BuildKit you can provide a flag during the build command: ```--progress=plain```
+
+Running tests with docker
+-------------------------
+
+```
+docker run -it computervision:cpu pytests tests/unit
+```
--- a/scenarios/classification/10_image_annotation.ipynb
+++ b/scenarios/classification/10_image_annotation.ipynb
@ -20,7 +20,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Open-source annotation tools for object detection and for image segmentation exist. When there is only one object per image,  labeling is done using separate folders for each image class. However we have not found a good tool for image classification when it's possible to have multiple objects in a single image.\n",
+    "Open-source annotation tools for object detection and for image segmentation exist, however for image classification are less common. When there is only one object per image,  labeling can be done by moving images manually into separate folders for each image class. This stategy however is manual, and does not work when it's possible to have multiple different objects in a single image. For such cases, either this notebook can be used, or e.g. this cloud-based [labeling tool](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-label-images).\n",
    "\n",
    "This notebook provides a simple UI to assist in labeling images. Each image can be annotated with one or more classes or be marked as \"Exclude\" to indicate that the image should not be used for model training or evaluation. "
   ]
--- a/scenarios/detection/00_webcam.ipynb
+++ b/scenarios/detection/00_webcam.ipynb
--- a/scenarios/detection/01_training_introduction.ipynb
+++ b/scenarios/detection/01_training_introduction.ipynb
--- a/scenarios/detection/12_hard_negative_sampling.ipynb
+++ b/scenarios/detection/12_hard_negative_sampling.ipynb
--- a/scenarios/detection/FAQ.md
+++ b/scenarios/detection/FAQ.md
@ -5,6 +5,7 @@
 This document tries to answer frequent questions related to object detection. For generic Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?" see also the image classification [FAQ](https://github.com/microsoft/ComputerVision/blob/master/classification/FAQ.md).

 * General
+  * [Why Torchvision?](#why-torchvision)

 * Data
  * [How to annotate images?](#how-to-annotate-images)
@ -18,6 +19,10 @@ This document tries to answer frequent questions related to object detection. Fo

 ## General

+### Why Torchvision?
+
+Torchvision has a large active user-base and hence its object detection implementation is easy to use, well tested, and uses state-of-the-art technology which has proven itself in the community. For these reasons we decided to use Torchvision as our object detection library. For advanced users who want to experiment with the latest cutting-edge technology, we recommend to start with our Torchvision notebooks and then also to look into more researchy implementations such as the [mmdetection](https://github.com/open-mmlab/mmdetection) repository. 
+
 ## Data

 ### How to annotate images?
--- a/scenarios/detection/media/figures.pptx
+++ b/scenarios/detection/media/figures.pptx
--- a/scenarios/detection/media/hard_neg.jpg
+++ b/scenarios/detection/media/hard_neg.jpg
--- a/tests/.ci/azure-pipeline-windows-cpu.yml
+++ b/tests/.ci/azure-pipeline-windows-cpu.yml
@ -21,3 +21,19 @@ jobs:
    displayName: Add conda to PATH
  
  - template: templates/unit-test-steps-not-linuxgpu.yml  # Template reference
+
+  - script: |
+     call conda env remove -n cv -y
+     rmdir /s /q C:\Anaconda\envs\cv
+
+    workingDirectory: tests
+    displayName: 'Conda remove'
+    continueOnError: true
+    condition: succeededOrFailed()
+    timeoutInMinutes: 10
+
+  - script: |
+     del /q /S %LOCALAPPDATA%\Temp\*
+     for /d %%i in (%LOCALAPPDATA%\Temp\*) do @rmdir /s /q "%%i"
+    displayName: 'Remove Temp Files'
+    condition: succeededOrFailed()
--- a/tests/.ci/azure-pipeline-windows-gpu.yml
+++ b/tests/.ci/azure-pipeline-windows-gpu.yml
@ -16,4 +16,20 @@ jobs:
    name: cvbpwinpool

  steps:
-  - template: templates/unit-test-steps-not-linuxgpu.yml  # Template reference
+  - template: templates/unit-test-steps-not-linuxgpu.yml  # Template reference
+  
+  - script: |
+     call conda env remove -n cv -y
+     rmdir /s /q C:\Anaconda\envs\cv
+
+    workingDirectory: tests
+    displayName: 'Conda remove'
+    continueOnError: true
+    condition: succeededOrFailed()
+    timeoutInMinutes: 10
+
+  - script: |
+     del /q /S %LOCALAPPDATA%\Temp\*
+     for /d %%i in (%LOCALAPPDATA%\Temp\*) do @rmdir /s /q "%%i"
+    displayName: 'Remove Temp Files'
+    condition: succeededOrFailed()
--- a/tests/.ci/repo_metrics_pipeline.yml
+++ b/tests/.ci/repo_metrics_pipeline.yml
@ -1,44 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT License.
-
-# More info on scheduling: https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops&tabs=yaml#scheduled-triggers
-# Implementing the scheduler from the dashboard
-# Uncomment in case it wants to be done from using the yml
-# schedules:
-# - cron: "56 22 * * *"
-#  displayName: Daily track of metrics
-#  branches:
-#    include:
-#    - master
-#  always: true
-
-
-# no PR builds
-pr: none
-
-# no CI trigger
-trigger: none
-
-jobs:
- job: Repometrics
-  pool:
-    vmImage: 'ubuntu-16.04'
-
-  steps:
-  - task: UsePythonVersion@0
-    inputs:
-      versionSpec: '3.6'
-      architecture: 'x64'
-
-  - script: |
-      cp tools/repo_metrics/config_template.py tools/repo_metrics/config.py
-      sed -i 's#<GITHUB_TOKEN>#$(github_token)#' tools/repo_metrics/config.py
-      sed -i "s#<CONNECTION_STRING>#`echo '$(cosmosdb_connectionstring)' | sed 's@&@\\\\&@g'`#" tools/repo_metrics/config.py
-    displayName: Configure CosmosDB Connection
-
-  - script: | 
-      python -m pip install 'python-dateutil>=2.8.0' 'pymongo>=3.8.0' 'gitpython>2.1.11' 'requests>=2.21.0'
-      python tools/repo_metrics/track_metrics.py --github_repo 'https://github.com/microsoft/ComputerVision' --save_to_database
-    displayName: Python script to record stats
-
-
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -139,6 +139,7 @@ def detection_notebooks():
        "11": os.path.join(
            folder_notebooks, "11_exploring_hyperparameters_on_azureml.ipynb"
        ),
+        "12": os.path.join(folder_notebooks, "12_hard_negative_sampling.ipynb"),    
    }
    return paths

--- a/tests/integration/detection/test_integration_detection.py
+++ b/tests/integration/detection/test_integration_detection.py
@ -26,3 +26,20 @@ def test_01_notebook_run(detection_notebooks):
    assert len(nb_output.scraps["training_losses"].data) == epochs
    assert nb_output.scraps["training_losses"].data[-1] < 0.5
    assert nb_output.scraps["training_average_precision"].data[-1] > 0.5
+
+@pytest.mark.notebooks
+@pytest.mark.linuxgpu
+def test_12_notebook_run(detection_notebooks):
+    notebook_path = detection_notebooks["12"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(PM_VERSION=pm.__version__, EPOCHS=3),
+        kernel_name=KERNEL_NAME,
+    )
+
+    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
+    assert nb_output.scraps["valid_accs"].data[-1] > 0.5
+    assert len(nb_output.scraps["valid_accs"].data) == 1
+    assert len(nb_output.scraps["hard_im_scores"].data) == 10
+
--- a/tests/integration/similarity/test_integration_similarity_notebooks.py
+++ b/tests/integration/similarity/test_integration_similarity_notebooks.py
@ -22,7 +22,7 @@ def test_01_notebook_run(similarity_notebooks):
    )

    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
-    assert nb_output.scraps["median_rank"].data <= 10
+    assert nb_output.scraps["median_rank"].data <= 15


@pytest.mark.notebooks
--- a/tests/unit/detection/test_detection_dataset.py
+++ b/tests/unit/detection/test_detection_dataset.py
@ -86,7 +86,7 @@ def test_detection_dataset_init_basic(tiny_od_data_path, od_data_path_labels):
    """ Tests that initialization of the Detection Dataset works. """
    data = DetectionDataset(tiny_od_data_path)
    validate_detection_dataset(data, od_data_path_labels)
-    assert len(data.test_ds) == 20
+    assert len(data.test_ds) == 19
    assert len(data.train_ds) == 20


@ -96,7 +96,7 @@ def test_detection_dataset_init_train_pct(
    """ Tests that initialization with train_pct."""
    data = DetectionDataset(tiny_od_data_path, train_pct=0.75)
    validate_detection_dataset(data, od_data_path_labels)
-    assert len(data.test_ds) == 10
+    assert len(data.test_ds) == 9
    assert len(data.train_ds) == 30


@ -105,6 +105,11 @@ def test_detection_dataset_show_ims(basic_detection_dataset):
    basic_detection_dataset.show_ims()


+def test_detection_dataset_show_im_transformations(basic_detection_dataset):
+    # simply test that this is error free for now
+    basic_detection_dataset.show_im_transformations()
+
+
 def test_detection_dataset_init_anno_im_dirs(
    func_tiny_od_data_path, od_data_path_labels
 ):
--- a/tests/unit/detection/test_detection_model.py
+++ b/tests/unit/detection/test_detection_model.py
@ -5,6 +5,9 @@ from torchvision.models.detection.faster_rcnn import FasterRCNN
 from collections.abc import Iterable
 import numpy as np
 import pytest
+import shutil
+from pathlib import Path
+from typing import Union

 from utils_cv.detection.bbox import DetectionBbox
 from utils_cv.detection.model import (
@ -128,3 +131,77 @@ def test_detection_dataset_predict_dl(
 ):
    """ Simply test that `predict_dl` works. """
    od_detection_learner.predict_dl(od_detection_dataset.test_dl)
+
+
+def validate_saved_model(name: str, path: str) -> bool:
+    """ Tests that saved model is there """
+    assert (Path(path)).exists()
+    assert (Path(path) / name).exists()
+    assert (Path(path) / name / "meta.json").exists()
+    assert (Path(path) / name / "model.pt").exists()
+
+
+@pytest.mark.gpu
+def test_detection_save_model(od_detection_learner, tiny_od_data_path):
+    """ Test that save function works. """
+
+    # test without path (default to using data_path()/models)
+    model_name = "my_test_model"
+    od_detection_learner.save(model_name)
+    validate_saved_model(model_name, Path(tiny_od_data_path) / "models")
+
+    # test with path
+    od_detection_learner.save(
+        model_name, str(Path(tiny_od_data_path) / "layer")
+    )
+    validate_saved_model(model_name, str(Path(tiny_od_data_path) / "layer"))
+
+    # test with overwrite
+    with pytest.raises(Exception):
+        od_detection_learner.save(model_name, overwrite=False)
+
+
+@pytest.mark.gpu
+@pytest.fixture(scope="session")
+def saved_model(od_detection_learner, tiny_od_data_path) -> Union[str, Path]:
+    """ A saved model so that loading functions can reuse. """
+    model_name = "test_fixture_model"
+    od_detection_learner.save(model_name)
+    assert (Path(tiny_od_data_path) / "models" / model_name).exists()
+    return model_name, Path(tiny_od_data_path) / "models"
+
+
+@pytest.mark.gpu
+def test_detection_load_model(
+    od_detection_learner, tiny_od_data_path, saved_model
+):
+    """ Test that load function works. """
+
+    # test basic loading
+    name, path = saved_model
+    od_detection_learner.load(name=name)
+    od_detection_learner.load(name=name, path=path)
+    assert od_detection_learner.labels is not None
+
+    # do not specify name or path, it should quietly exit
+    with pytest.raises(SystemExit) as pytest_wrapped_e:
+        od_detection_learner.load()
+    assert pytest_wrapped_e.type == SystemExit
+
+    # test of no model files exists
+    shutil.rmtree(path / name)
+    shutil.rmtree(Path(tiny_od_data_path) / "models")
+
+    with pytest.raises(Exception):
+        od_detection_learner.load()
+
+    # test if only one model file exists in the `data_path`
+    od_detection_learner.save("test_fixture_model")
+    od_detection_learner.load()
+
+
+@pytest.mark.gpu
+def test_detection_init_from_saved_model(saved_model):
+    """ Test that we can create an detection learner from a saved model. """
+    name, path = saved_model
+    DetectionLearner.from_saved_model(name, path)
--- a/tests/unit/detection/test_detection_notebooks.py
+++ b/tests/unit/detection/test_detection_notebooks.py
@ -29,8 +29,8 @@ def test_00_notebook_run(detection_notebooks):
    assert len(nb_output.scraps["detection_bounding_box"].data) > 0


-@pytest.mark.notebooks
@pytest.mark.gpu
+@pytest.mark.notebooks
 def test_01_notebook_run(detection_notebooks, tiny_od_data_path):
    notebook_path = detection_notebooks["01"]
    pm.execute_notebook(
@ -48,3 +48,25 @@ def test_01_notebook_run(detection_notebooks, tiny_od_data_path):
    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
    assert len(nb_output.scraps["training_losses"].data) > 0
    assert len(nb_output.scraps["training_average_precision"].data) > 0
+
+    
+@pytest.mark.gpu
+@pytest.mark.notebooks
+def test_12_notebook_run(detection_notebooks, tiny_od_data_path):
+    notebook_path = detection_notebooks["12"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(
+            PM_VERSION=pm.__version__,
+            DATA_PATH=tiny_od_data_path,
+            EPOCHS=1,
+            IM_SIZE=100,
+        ),
+        kernel_name=KERNEL_NAME,
+    )
+
+    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
+    assert len(nb_output.scraps["valid_accs"].data) == 1
+    assert len(nb_output.scraps["hard_im_scores"].data) == 10
+    
--- a/utils_cv/detection/dataset.py
+++ b/utils_cv/detection/dataset.py
@ -2,6 +2,7 @@
 # Licensed under the MIT License.

 import os
+import copy
 import math
 from pathlib import Path
 from random import randrange
@ -9,6 +10,7 @@ from typing import List, Tuple, Union

 import torch
 from torch.utils.data import Dataset, Subset, DataLoader
+from torchvision.transforms import ColorJitter
 import xml.etree.ElementTree as ET
 from PIL import Image

@ -19,6 +21,26 @@ from .references.transforms import RandomHorizontalFlip, Compose, ToTensor
 from utils_cv.common.gpu import db_num_workers


+class ColorJitterTransform(object):
+    """ Wrapper for torchvision's ColorJitter to make sure 'target
+    object is passed along """
+
+    def __init__(self, brightness, contrast, saturation, hue):
+        self.brightness = brightness
+        self.contrast = contrast
+        self.saturation = saturation
+        self.hue = hue
+
+    def __call__(self, im, target):
+        im = ColorJitter(
+            brightness=self.brightness,
+            contrast=self.contrast,
+            saturation=self.saturation,
+            hue=self.hue,
+        )(im)
+        return im, target
+
+
 def get_transform(train: bool) -> List[object]:
    """ Gets basic the transformations to apply to images.

@ -33,10 +55,22 @@ def get_transform(train: bool) -> List[object]:
        A list of transforms to apply.
    """
    transforms = []
+
+    # transformations to apply before image is turned into a tensor
+    if train:
+        transforms.append(
+            ColorJitterTransform(
+                brightness=0.2, contrast=0.2, saturation=0.4, hue=0.05
+            )
+        )
+
+    # transform im to tensor
    transforms.append(ToTensor())
+
+    # transformations to apply after image is turned into a tensor
    if train:
        transforms.append(RandomHorizontalFlip(0.5))
-        # TODO we can add more 'default' transformations here
+
    return Compose(transforms)


@ -59,7 +93,7 @@ def parse_pascal_voc_anno(

    # get image path from annotation. Note that the path field might not be set.
    anno_dir = os.path.dirname(anno_path)
-    if root.find("path"):
+    if root.find("path") is not None:
        im_path = os.path.realpath(
            os.path.join(anno_dir, root.find("path").text)
        )
@ -73,10 +107,10 @@ def parse_pascal_voc_anno(
    for obj in objs:
        label = obj.find("name").text
        bnd_box = obj.find("bndbox")
-        left = int(bnd_box.find('xmin').text)
-        top = int(bnd_box.find('ymin').text)
-        right = int(bnd_box.find('xmax').text)
-        bottom = int(bnd_box.find('ymax').text)
+        left = int(bnd_box.find("xmin").text)
+        top = int(bnd_box.find("ymin").text)
+        right = int(bnd_box.find("xmax").text)
+        bottom = int(bnd_box.find("ymax").text)

        # Set mapping of label name to label index
        if labels is None:
@ -99,7 +133,7 @@ def parse_pascal_voc_anno(
 class DetectionDataset:
    """ An object detection dataset.

-    The dunder methods __init__, __getitem__, and __len__ were inspired from code found here:
+    The implementation of the dunder methods __init__, __getitem__, and __len__ were inspired from code found here:
    https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#writing-a-custom-dataset-for-pennfudan
    """

@ -107,10 +141,12 @@ class DetectionDataset:
        self,
        root: Union[str, Path],
        batch_size: int = 2,
-        transforms: object = get_transform(train=True),
+        train_transforms: object = get_transform(train=True),
+        test_transforms: object = get_transform(train=False),
        train_pct: float = 0.5,
        anno_dir: str = "annotations",
        im_dir: str = "images",
+        allow_negatives: bool = False,
    ):
        """ initialize dataset

@ -123,19 +159,22 @@ class DetectionDataset:
            root: the root path of the dataset containing the image and
            annotation folders
            batch_size: batch size for dataloaders
-            transforms: the transformations to apply
+            train_transforms: the transformations to apply to the train set
+            test_transforms: the transformations to apply to the test set
            train_pct: the ratio of training to testing data
            annotation_dir: the name of the annotation subfolder under the root directory
            im_dir: the name of the image subfolder under the root directory. If set to 'None' then infers image location from annotation .xml files
+            allow_negatives: is false (default) then will throw an error if no anntation .xml file can be found for a given image. Otherwise use image as negative, ie assume that the image does not contain any of the objects of interest.
        """

        self.root = Path(root)
-        # TODO think about how transforms are working...
-        self.transforms = transforms
+        self.train_transforms = train_transforms
+        self.test_transforms = test_transforms
        self.im_dir = im_dir
        self.anno_dir = anno_dir
        self.batch_size = batch_size
        self.train_pct = train_pct
+        self.allow_negatives = allow_negatives

        # read annotations
        self._read_annos()
@ -146,21 +185,7 @@ class DetectionDataset:
        )

        # create training and validation data loaders
-        self.train_dl = DataLoader(
-            self.train_ds,
-            batch_size=self.batch_size,
-            shuffle=True,
-            num_workers=db_num_workers(),
-            collate_fn=collate_fn,
-        )
-
-        self.test_dl = DataLoader(
-            self.test_ds,
-            batch_size=self.batch_size,
-            shuffle=False,
-            num_workers=db_num_workers(),
-            collate_fn=collate_fn,
-        )
+        self.init_data_loaders()

    def _read_annos(self) -> List[str]:
        """ Parses all Pascal VOC formatted annotation files to extract all
@ -182,20 +207,33 @@ class DetectionDataset:
                os.path.splitext(s)[0] + ".xml" for s in im_filenames
            ]

-        # Parse all annotations
+        # Read all annotations
        self.im_paths = []
        self.anno_paths = []
        self.anno_bboxes = []
        for anno_idx, anno_filename in enumerate(anno_filenames):
            anno_path = self.root / self.anno_dir / str(anno_filename)
-            assert os.path.exists(
-                anno_path
-            ), f"Cannot find annotation file: {anno_path}"
-            anno_bboxes, im_path = parse_pascal_voc_anno(anno_path)

-            # TODO For now, ignore all images without a single bounding box in it, otherwise throws error during training.
+            # Parse annotation file if present
+            if os.path.exists(anno_path):
+                anno_bboxes, im_path = parse_pascal_voc_anno(anno_path)
+            else:
+                if not self.allow_negatives:
+                    raise FileNotFoundError(anno_path)
+                anno_bboxes = []
+                im_path = im_paths[anno_idx]
+
+            # Torchvision needs at least one ground truth bounding box per image. Hence for images without a single
+            # annotated object, adding a tiny bounding box with "background" label 0.
            if len(anno_bboxes) == 0:
-                continue
+                anno_bboxes = [
+                    AnnotationBbox.from_array(
+                        [1, 1, 5, 5],
+                        label_name=None,
+                        label_idx=0,
+                        im_path=im_path,
+                    )
+                ]

            if self.im_dir is None:
                self.im_paths.append(im_path)
@ -209,15 +247,21 @@ class DetectionDataset:
        labels = []
        for anno_bboxes in self.anno_bboxes:
            for anno_bbox in anno_bboxes:
-                labels.append(anno_bbox.label_name)
+                if anno_bbox.label_name is not None:
+                    labels.append(anno_bbox.label_name)
        self.labels = list(set(labels))

        # Set for each bounding box label name also what its integer representation is
        for anno_bboxes in self.anno_bboxes:
            for anno_bbox in anno_bboxes:
-                anno_bbox.label_idx = (
-                    self.labels.index(anno_bbox.label_name) + 1
-                )
+                if (
+                    anno_bbox.label_name is None
+                ):  # background rectangle is assigned id 0 by design
+                    anno_bbox.label_idx = 0
+                else:
+                    anno_bbox.label_idx = (
+                        self.labels.index(anno_bbox.label_name) + 1
+                    )

    def split_train_test(
        self, train_pct: float = 0.8
@ -231,20 +275,69 @@ class DetectionDataset:
        Return
            A training and testing dataset in that order
        """
-        # TODO Is it possible to make these lines in split_train_test() a bit
-        # more intuitive?
-
        test_num = math.floor(len(self) * (1 - train_pct))
        indices = torch.randperm(len(self)).tolist()

-        self.transforms = get_transform(train=True)
-        train = Subset(self, indices[test_num:])
+        train = copy.deepcopy(Subset(self, indices[test_num:]))
+        train.dataset.transforms = self.train_transforms

-        self.transforms = get_transform(train=False)
-        test = Subset(self, indices[: test_num + 1])
+        test = copy.deepcopy(Subset(self, indices[:test_num]))
+        test.dataset.transforms = self.test_transforms

        return train, test

+    def init_data_loaders(self):
+        """ Create training and validation data loaders """
+        self.train_dl = DataLoader(
+            self.train_ds,
+            batch_size=self.batch_size,
+            shuffle=True,
+            num_workers=db_num_workers(),
+            collate_fn=collate_fn,
+        )
+
+        self.test_dl = DataLoader(
+            self.test_ds,
+            batch_size=self.batch_size,
+            shuffle=False,
+            num_workers=db_num_workers(),
+            collate_fn=collate_fn,
+        )
+
+    def add_images(
+        self,
+        im_paths: List[str],
+        anno_bboxes: List[AnnotationBbox],
+        target: str = "train",
+    ):
+        """ Add new images to either the training or test set.
+
+        Args:
+            im_paths: path to the images.
+            anno_bboxes: ground truth boxes for each image.
+            target: specify if images are to be added to the training or test set. Valid options: "train" or "test".
+
+        Raises:
+            Exception if `target` variable is neither 'train' nor 'test'
+        """
+        assert len(im_paths) == len(anno_bboxes)
+        for im_path, anno_bbox in zip(im_paths, anno_bboxes):
+            self.im_paths.append(im_path)
+            self.anno_bboxes.append(anno_bbox)
+            if target.lower() == "train":
+                self.train_ds.dataset.im_paths.append(im_path)
+                self.train_ds.dataset.anno_bboxes.append(anno_bbox)
+                self.train_ds.indices.append(len(self.im_paths) - 1)
+            elif target.lower() == "test":
+                self.test_ds.dataset.im_paths.append(im_path)
+                self.test_ds.dataset.anno_bboxes.append(anno_bbox)
+                self.test_ds.indices.append(len(self.im_paths) - 1)
+            else:
+                raise Exception(f"Target {target} unknown.")
+
+        # Re-initialize the data loaders
+        self.init_data_loaders()
+
    def show_ims(self, rows: int = 1, cols: int = 3) -> None:
        """ Show a set of images.

@ -256,6 +349,43 @@ class DetectionDataset:
        """
        plot_grid(display_bboxes, self._get_random_anno, rows=rows, cols=cols)

+    def show_im_transformations(
+        self, idx: int = None, rows: int = 1, cols: int = 3
+    ) -> None:
+        """ Show a set of images after transfomrations have been applied.
+
+        Args:
+            idx: the index to of the image to show the transformations for.
+            rows: number of rows to display
+            cols: number of cols to dipslay, NOTE: use 3 for best looing grid
+
+        Returns None but displays a grid of randomly applied transformations.
+        """
+        if not hasattr(self, "transforms"):
+            print(
+                (
+                    "Transformations are not applied ot the base dataset object.\n"
+                    "Call this function on either the train_ds or test_ds instead:\n\n"
+                    "    my_detection_data.train_ds.dataset.show_im_transformations()"
+                )
+            )
+        else:
+            if idx is None:
+                idx = randrange(len(self.anno_paths))
+
+            def plotter(im, ax):
+                ax.set_xticks([])
+                ax.set_yticks([])
+                ax.imshow(im)
+
+            def im_gen() -> torch.Tensor:
+                return self[idx][0].permute(1, 2, 0)
+
+            plot_grid(plotter, im_gen, rows=rows, cols=cols)
+
+            print(f"Transformations applied on {self.im_paths[idx]}:")
+            [print(transform) for transform in self.transforms.transforms]
+
    def _get_random_anno(
        self
    ) -> Tuple[List[AnnotationBbox], Union[str, Path]]:
--- a/utils_cv/detection/model.py
+++ b/utils_cv/detection/model.py
@ -1,8 +1,11 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.

+import os
 from typing import List, Tuple, Union, Generator, Optional
 from pathlib import Path
+import json
+import shutil

 from PIL import Image
 import numpy as np
@ -130,13 +133,16 @@ def get_pretrained_fasterrcnn(
    return model


-def _calculate_ap(e: CocoEvaluator) -> float:
+def _calculate_ap(
+    e: CocoEvaluator, iou_threshold_idx: Union[int, slice] = slice(0, None)
+) -> float:
    """ Calculate the Average Precision (AP) by averaging all iou
    thresholds across all labels.

    coco_eval.eval['precision'] is a 5-dimensional array. Each dimension
    represents the following:
-    1. [T] 10 evenly distributed thresholds for IoU, from 0.5 to 0.95.
+    1. [T] 10 evenly distributed thresholds for IoU, from 0.5 to 0.95. By
+    default, we use slice(0, None) which is the average from 0.5 to 0.95.
    2. [R] 101 recall thresholds, from 0 to 101
    3. [K] label, set to slice(0, None) to get precision over all the labels in
    the dataset. Then take the mean over all labels.
@ -147,7 +153,13 @@ def _calculate_ap(e: CocoEvaluator) -> float:
    Therefore, coco_eval.eval['precision'][0, :, 0, 0, 2] represents the value
    of 101 precisions corresponding to 101 recalls from 0 to 100 when IoU=0.5.
    """
-    precision_settings = (slice(0, None), slice(0, None), slice(0, None), 0, 2)
+    precision_settings = (
+        iou_threshold_idx,
+        slice(0, None),
+        slice(0, None),
+        0,
+        2,
+    )
    coco_eval = e.coco_eval["bbox"].eval["precision"]
    return np.mean(np.mean(coco_eval[precision_settings]))

@ -155,15 +167,46 @@ def _calculate_ap(e: CocoEvaluator) -> float:
 class DetectionLearner:
    """ Detection Learner for Object Detection"""

-    def __init__(self, dataset: Dataset, model: nn.Module = None):
-        """ Initialize leaner object. """
+    def __init__(
+        self,
+        dataset: Dataset = None,
+        model: nn.Module = None,
+        im_size: int = None,
+    ):
+        """ Initialize leaner object.
+
+        You can only specify an image size `im_size` if `model` is not given.
+
+        Args:
+            dataset: the dataset. This class will infer labels if dataset is present.
+            model: the nn.Module you wish to use
+            im_size: image size for your model
+        """
+        # if model is None, dataset must not be
+        if not model:
+            assert dataset is not None
+
+        # not allowed to specify im size if you're providing a model
+        if model:
+            assert im_size is None
+
+        # if im_size is not specified, use 500
+        if im_size is None:
+            im_size = 500
+
        self.device = torch_device()
        self.model = model
        self.dataset = dataset
+        self.im_size = im_size

        # setup model, default to fasterrcnn
        if self.model is None:
-            self.model = get_pretrained_fasterrcnn(len(dataset.labels) + 1)
+            self.model = get_pretrained_fasterrcnn(
+                len(self.dataset.labels) + 1,
+                min_size=self.im_size,
+                max_size=self.im_size,
+            )
+
        self.model.to(self.device)

    def __getattr__(self, attr):
@ -175,6 +218,12 @@ class DetectionLearner:
            )
        )

+    def add_labels(self, labels: List[str]):
+        """ Add labels to this detector. This class does not expect a label
+        '__background__' in first element of the label list. Make sure it is
+        omitted before adding it. """
+        self.labels = labels
+
    def fit(
        self,
        epochs: int,
@ -205,6 +254,7 @@ class DetectionLearner:
        # store data in these arrays to plot later
        self.losses = []
        self.ap = []
+        self.ap_iou_point_5 = []

        # main training loop
        self.epochs = epochs
@ -227,6 +277,7 @@ class DetectionLearner:
            # evaluate
            e = self.evaluate(dl=self.dataset.test_dl)
            self.ap.append(_calculate_ap(e))
+            self.ap_iou_point_5.append(_calculate_ap(e, iou_threshold_idx=0))

    def plot_precision_loss_curves(
        self, figsize: Tuple[int, int] = (10, 5)
@ -285,7 +336,8 @@ class DetectionLearner:
        model = self.model.eval()  # eval mode
        with torch.no_grad():
            pred = model([im])
-        labels = self.dataset.labels
+
+        labels = self.dataset.labels if self.dataset else self.labels
        det_bboxes = _get_det_bboxes(pred, labels=labels)

        # limit to threshold if threshold is set
@ -345,3 +397,163 @@ class DetectionLearner:
                    {"idx": im_idx, "det_bboxes": det_bboxes}
                )
            yield det_bbox_batch
+
+    def save(
+        self, name: str, path: str = None, overwrite: bool = True
+    ) -> None:
+        """ Saves the model
+
+        Save your model in the following format:
+        /data_path()
+        +-- <name>
+        |   +-- meta.json
+        |   +-- model.pt
+
+        The meta.json will contain information like the labels and the im_size
+        The model.pt will contain the weights of the model
+
+        Args:
+            name: the name you wish to save your model under
+            path: optional path to save your model to, will use `data_path`
+            otherwise
+            overwrite: overwite existing models
+
+        Raise:
+            Exception if model file already exists but overwrite is set to
+            false
+
+        Returns None
+        """
+        if path is None:
+            path = Path(self.dataset.root) / "models"
+
+        # make dir if not exist
+        if not Path(path).exists():
+            os.mkdir(path)
+
+        # make dir to contain all model/meta files
+        model_path = Path(path) / name
+        if model_path.exists():
+            if overwrite:
+                shutil.rmtree(str(model_path))
+            else:
+                raise Exception(
+                    f"Model of {name} already exists in {path}. Set `overwrite=True` or use another name"
+                )
+        os.mkdir(model_path)
+
+        # set names
+        pt_path = model_path / f"model.pt"
+        meta_path = model_path / f"meta.json"
+
+        # save pt
+        torch.save(self.model.state_dict(), pt_path)
+
+        # save meta file
+        meta_data = {"labels": self.dataset.labels, "im_size": self.im_size}
+        with open(meta_path, "w") as meta_file:
+            json.dump(meta_data, meta_file)
+
+        print(f"Model is saved to {model_path}")
+
+    def load(self, name: str = None, path: str = None) -> None:
+        """ Loads a model.
+
+        Loads a model that is saved in the format that is outputted in the
+        `save` function.
+
+        Args:
+            name: The name of the model you wish to load. If no name is
+            specified, the function will still look for a model under the path
+            specified by `data_path`. If multiple models are available in that
+            path, it will require you to pass in a name to specify which one to
+            use.
+            path: Pass in a path if the model is not located in the
+            `data_path`. Otherwise it will assume that it is.
+
+        Raise:
+            Exception if passed in name/path is invalid and doesn't exist
+        """
+
+        # set path
+        if not path:
+            if self.dataset:
+                path = Path(self.dataset.root) / "models"
+            else:
+                raise Exception("Specify a `path` parameter")
+
+        # if name is given..
+        if name:
+            model_path = path / name
+
+            pt_path = model_path / "model.pt"
+            if not pt_path.exists():
+                raise Exception(
+                    f"No model file named model.pt exists in {model_path}"
+                )
+
+            meta_path = model_path / "meta.json"
+            if not meta_path.exists():
+                raise Exception(
+                    f"No model file named meta.txt exists in {model_path}"
+                )
+
+        # if no name is given, we assume there is only one model, otherwise we
+        # throw an error
+        else:
+            models = [f.path for f in os.scandir(path) if f.is_dir()]
+
+            if len(models) == 0:
+                raise Exception(f"No model found in {path}.")
+            elif len(models) > 1:
+                print(
+                    f"Multiple models were found in {path}. Please specify which you wish to use in the `name` argument."
+                )
+                for model in models:
+                    print(model)
+                exit()
+            else:
+                pt_path = Path(models[0]) / "model.pt"
+                meta_path = Path(models[0]) / "meta.json"
+
+        # load into model
+        self.model.load_state_dict(
+            torch.load(pt_path, map_location=torch_device())
+        )
+
+        # load meta info
+        with open(meta_path, "r") as meta_file:
+            meta_data = json.load(meta_file)
+            self.labels = meta_data["labels"]
+
+    @classmethod
+    def from_saved_model(cls, name: str, path: str) -> "DetectionLearner":
+        """ Create an instance of the DetectionLearner from a saved model.
+
+        This function expects the format that is outputted in the `save`
+        function.
+
+        Args:
+            name: the name of the model you wish to load
+            path: the path to get your model from
+
+        Returns:
+            A DetectionLearner object that can inference.
+        """
+        path = Path(path)
+
+        meta_path = path / name / "meta.json"
+        assert meta_path.exists()
+
+        im_size, labels = None, None
+        with open(meta_path) as json_file:
+            meta_data = json.load(json_file)
+            im_size = meta_data["im_size"]
+            labels = meta_data["labels"]
+
+        model = get_pretrained_fasterrcnn(
+            len(labels) + 1, min_size=im_size, max_size=im_size
+        )
+        detection_learner = DetectionLearner(model=model)
+        detection_learner.load(name=name, path=path)
+        return detection_learner
--- a/utils_cv/detection/plot.py
+++ b/utils_cv/detection/plot.py
@ -56,7 +56,12 @@ def plot_boxes(
    """
    if len(bboxes) > 0:
        draw = ImageDraw.Draw(im)
+        font = get_font(size=plot_settings.text_size)
+
        for bbox in bboxes:
+            # do not draw background bounding boxes
+            if bbox.label_idx == 0:
+                continue

            box = [(bbox.left, bbox.top), (bbox.right, bbox.bottom)]

@ -67,9 +72,6 @@ def plot_boxes(
                width=plot_settings.rect_th,
            )

-            # gets font
-            font = get_font(size=plot_settings.text_size)
-
            # write prediction class
            draw.text(
                (bbox.left, bbox.top),