Video Dataset / Model refactor + framework for action tests (#477)

* Removed submodules. * Add back submodules using https:// * Update FAQ.md * added object detection readme * fixes to ic (#437) * fixes to ic * Matplotlib bug fix * Matplotlib matrix plot bug fix * Fix 01 notebook heatmap * revert env * revert env yml * remove matplotlib * simplified plotting functions * fixed most tests * fixed test * fixed unit test * small text edits to the 02 notebook * added fct description * tiny cleanup on notebook * Move r2p1d from contrib to scenarios. * Update .gitignore. * Add README.md * Remove the folder /scenario/action_recognition/data/samples; update notebook to use web url for sample data. * Move data split files to data/misc; update notebook accordingly. * Add pretrained keypoint model (#453) * Add pretrained keypoint model * Fix bugs in tests * Add 03 notebook in conftest.py * Minor revision * Reformat code using black * if folder exists, remove (#448) * Update data path. * Add mask annotation tool (#447) * Add mask annotation tool * Update mask annotation explanation and add converion scripts * Add screenshots of Labelbox annotation * Rearrange screenshots * Move convertion script into functions in data.py * Point out annotation conversion scripts clearly in notebook * Refine annotation conversion scripts * Fix bugs * Add tests for labelbox format conversion methods * Update README.md * Add keypoint detection with tuned model (#454) * Add keypoint detetion with tuned model * Add tests * Minor revision * Update tests * Fix bugs in tests * Use GPU device if available * Update tests * Fix bug: 'not idx' will be 'True' if 'idx=0' * Fix bugs * Move toy keypoint meta into notebook * Fix bugs * Fix bugs * Fix bugs in notebook * Add descriptions for keypoint meta data * Raise exception when RandomHorizontalFlip is used without specifying hflip_inds * Add NOTICE file. * Add keypoint detection model tuning with top and bottom keypoints (#456) * Add keypoint detection model tuning with top and bottom keypoints * Fix undefined unzip_url * Resolved undefined od_urls * Plot keypoints as round dots to make them noticeable (#458) * Plot keypoints as dots * Change variable naming * Add annotation tool to scenarios. * Resolve test machine failure (#460) This is due to the latest PyTorch (version 1.3) from conda is built on CUDA 10.1 while the version on the test machine is CUDA 10.0. * Remove unused imports in 02_mask_rcnn.ipynb (#463) * Remove unused imports in 02_mask_rcnn.ipynb * Add missing imports * Simplify binary_mask() (#464) * remove conflict code (#471) * Update README.md (#472) * unit test for action rec * reformat files * added 01/02 notebooks * fix all unit tests + abstract out commons from action rec * dataset * test data * black reformat * refactor action rec * ignore /data * notebook update * update gitignore * manage transforms better * tfms_config defaults * video dataset refactor + black * notebook update with video datsaet refactored out * Refactor model/dataset * clean up * refactor + beautification * re-run 02 notebook * make tests work locally * pr fixes * PR fixes * pr fixes * pr fixes * update env * pr fix * move decord to pip * added ref to config * update pr fix * flake8 + pr bug Co-authored-by: Lixun Zhang <lixun.zhang@microsoft.com> Co-authored-by: Lixun <lixzhang@users.noreply.github.com> Co-authored-by: PatrickBue <pabuehle@microsoft.com> Co-authored-by: Simon Zhao <43029286+simonzhaoms@users.noreply.github.com> Co-authored-by: Miguel González-Fierro <3491412+miguelgfierro@users.noreply.github.com>
2020-03-26 13:06:55 -04:00 · 2020-03-26 13:06:55 -04:00 · 8221c1659e
--- a/.gitignore
+++ b/.gitignore
@ -116,7 +116,6 @@ output.ipynb
 # don't save any data
 classification/data/*
 /data/*
-!/data/misc
 !contrib/action_recognition/r2p1d/**
 !contrib/crowd_counting/crowdcounting/data/
 !scenarios/action_recognition/data
--- a/environment.yml
+++ b/environment.yml
@ -1,4 +1,3 @@
-# 
 # To create the conda environment:
 # $ conda env create -f environment.yml
 # 
@ -36,8 +35,10 @@ dependencies:
 - pre-commit>=1.14.4
 - pyyaml>=5.1.2
 - requests>=2.22.0
+- einops==0.1.0
 - cytoolz
 - pip:
+  - decord==0.3.5
  - nvidia-ml-py3
  - nteract-scrapbook
  - azureml-sdk[notebooks,contrib]>=1.0.30
--- a/scenarios/action_recognition/00_webcam.ipynb
+++ b/scenarios/action_recognition/00_webcam.ipynb
--- a/scenarios/action_recognition/01_training_introduction.ipynb
+++ b/scenarios/action_recognition/01_training_introduction.ipynb
--- a/scenarios/action_recognition/02_training_hmdb.ipynb
+++ b/scenarios/action_recognition/02_training_hmdb.ipynb
--- a/scenarios/action_recognition/02_video_transformation.ipynb
+++ b/scenarios/action_recognition/02_video_transformation.ipynb
@ -1,316 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Video Dataset Transformation  \n",
-    "\n",
-    "In this notebook, we show examples of video dataset transformation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%load_ext autoreload\n",
-    "%autoreload 2"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import sys\n",
-    "sys.path.append(\"../../\")\n",
-    "import os\n",
-    "import time\n",
-    "import decord\n",
-    "import matplotlib.pyplot as plt\n",
-    "import numpy as np\n",
-    "from sklearn.metrics import accuracy_score\n",
-    "import torch\n",
-    "import torch.cuda as cuda\n",
-    "import torch.nn as nn\n",
-    "import torchvision\n",
-    "import urllib.request\n",
-    "import shutil\n",
-    "\n",
-    "from utils_cv.action_recognition.data import show_batch, VideoDataset\n",
-    "from utils_cv.action_recognition.model import DEFAULT_MEAN, DEFAULT_STD\n",
-    "from utils_cv.action_recognition import system_info\n",
-    "from utils_cv.action_recognition.functional_video import denormalize\n",
-    "from utils_cv.action_recognition.transforms_video import (\n",
-    "    CenterCropVideo,    \n",
-    "    NormalizeVideo,\n",
-    "    RandomCropVideo,\n",
-    "    RandomHorizontalFlipVideo,\n",
-    "    RandomResizedCropVideo,\n",
-    "    ResizeVideo,\n",
-    "    ToTensorVideo,\n",
-    ")\n",
-    "\n",
-    "system_info()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def show_clip(clip, size_factor=600):\n",
-    "    \"\"\"Show frames in a clip\"\"\"\n",
-    "    if isinstance(clip, torch.Tensor):\n",
-    "        # Convert [C, T, H, W] tensor to [T, H, W, C] numpy array \n",
-    "        clip = np.moveaxis(clip.numpy(), 0, -1)\n",
-    "    \n",
-    "    figsize = np.array([clip[0].shape[1]*len(clip), clip[0].shape[0]]) / size_factor\n",
-    "    plt.tight_layout()\n",
-    "    fig, axs = plt.subplots(1, len(clip), figsize=figsize)\n",
-    "    for i, f in enumerate(clip):\n",
-    "        axs[i].axis(\"off\")\n",
-    "        axs[i].imshow(f)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Prepare a Sample Video\n",
-    "A sample video path:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "url = \"https://cvbp.blob.core.windows.net/public/datasets/action_recognition/drinking.mp4\"\n",
-    "VIDEO_PATH = os.path.join(\"../../data/drinking.mp4\")\n",
-    "# Download the file from `url` and save it locally under `file_name`:\n",
-    "with urllib.request.urlopen(url) as response, open(VIDEO_PATH, 'wb') as out_file:\n",
-    "    shutil.copyfileobj(response, out_file)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "video_reader = decord.VideoReader(VIDEO_PATH)\n",
-    "video_length = len(video_reader)\n",
-    "print(\"Video length = {} frames\".format(video_length))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We use three frames (the first, middle, and the last) to quickly visualize video transformations."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "clip = [\n",
-    "    video_reader[0].asnumpy(),\n",
-    "    video_reader[video_length//2].asnumpy(),\n",
-    "    video_reader[video_length-1].asnumpy(),\n",
-    "]\n",
-    "show_clip(clip)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# [T, H, W, C] numpy array to [C, T, H, W] tensor\n",
-    "t_clip = ToTensorVideo()(torch.from_numpy(np.array(clip)))\n",
-    "t_clip.shape"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Video Transformations\n",
-    "\n",
-    "Resizing with the original ratio"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(ResizeVideo(size=800)(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Resizing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(ResizeVideo(size=800, keep_ratio=False)(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Center cropping"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(CenterCropVideo(size=800)(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Random cropping"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "random_crop = RandomCropVideo(size=800)\n",
-    "show_clip(random_crop(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(random_crop(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Random resized cropping"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "random_resized_crop = RandomResizedCropVideo(size=800)\n",
-    "show_clip(random_resized_crop(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(random_resized_crop(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Normalizing (and denormalizing to verify)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "norm_t_clip = NormalizeVideo(mean=DEFAULT_MEAN, std=DEFAULT_STD)(t_clip)\n",
-    "show_clip(norm_t_clip)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(denormalize(norm_t_clip, mean=DEFAULT_MEAN, std=DEFAULT_STD))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Horizontal flipping"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "show_clip(RandomHorizontalFlipVideo(p=.5)(t_clip))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "r2p1d",
-   "language": "python",
-   "name": "r2p1d"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/scenarios/action_recognition/10_video_transformation.ipynb
+++ b/scenarios/action_recognition/10_video_transformation.ipynb
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -73,6 +73,18 @@ def path_detection_notebooks():
    )


+def path_action_recognition_notebooks():
+    """ Returns the path of the action recognition notebooks folder. """
+    return os.path.abspath(
+        os.path.join(
+            os.path.dirname(__file__),
+            os.path.pardir,
+            "scenarios",
+            "action_recognition",
+        )
+    )
+
+
 # ----- Module fixtures ----------------------------------------------------------


@ -82,39 +94,33 @@ def classification_notebooks():

    # Path for the notebooks
    paths = {
-        "00_webcam": os.path.join(folder_notebooks, "00_webcam.ipynb"),
-        "01_training_introduction": os.path.join(
-            folder_notebooks, "01_training_introduction.ipynb"
-        ),
-        "02_multilabel_classification": os.path.join(
+        "00": os.path.join(folder_notebooks, "00_webcam.ipynb"),
+        "01": os.path.join(folder_notebooks, "01_training_introduction.ipynb"),
+        "02": os.path.join(
            folder_notebooks, "02_multilabel_classification.ipynb"
        ),
-        "03_training_accuracy_vs_speed": os.path.join(
+        "03": os.path.join(
            folder_notebooks, "03_training_accuracy_vs_speed.ipynb"
        ),
-        "10_image_annotation": os.path.join(
-            folder_notebooks, "10_image_annotation.ipynb"
-        ),
-        "11_exploring_hyperparameters": os.path.join(
+        "10": os.path.join(folder_notebooks, "10_image_annotation.ipynb"),
+        "11": os.path.join(
            folder_notebooks, "11_exploring_hyperparameters.ipynb"
        ),
-        "12_hard_negative_sampling": os.path.join(
+        "12": os.path.join(
            folder_notebooks, "12_hard_negative_sampling.ipynb"
        ),
-        "20_azure_workspace_setup": os.path.join(
-            folder_notebooks, "20_azure_workspace_setup.ipynb"
-        ),
-        "21_deployment_on_azure_container_instances": os.path.join(
+        "20": os.path.join(folder_notebooks, "20_azure_workspace_setup.ipynb"),
+        "21": os.path.join(
            folder_notebooks,
            "21_deployment_on_azure_container_instances.ipynb",
        ),
-        "22_deployment_on_azure_kubernetes_service": os.path.join(
+        "22": os.path.join(
            folder_notebooks, "22_deployment_on_azure_kubernetes_service.ipynb"
        ),
-        "23_aci_aks_web_service_testing": os.path.join(
+        "23": os.path.join(
            folder_notebooks, "23_aci_aks_web_service_testing.ipynb"
        ),
-        "24_exploring_hyperparameters_on_azureml": os.path.join(
+        "24": os.path.join(
            folder_notebooks, "24_exploring_hyperparameters_on_azureml.ipynb"
        ),
    }
@ -164,6 +170,20 @@ def detection_notebooks():
    return paths


+@pytest.fixture(scope="module")
+def action_recognition_notebooks():
+    folder_notebooks = path_action_recognition_notebooks()
+
+    # Path for the notebooks
+    paths = {
+        "00": os.path.join(folder_notebooks, "00_webcam.ipynb"),
+        "01": os.path.join(folder_notebooks, "01_training_introduction.ipynb"),
+        "02": os.path.join(folder_notebooks, "02_training_hmbd.ipynb"),
+        "10": os.path.join(folder_notebooks, "10_video_transformation.ipynb"),
+    }
+    return paths
+
+
 # ----- Function fixtures ----------------------------------------------------------


@ -378,7 +398,7 @@ def od_cup_path(tmp_session) -> str:

@pytest.fixture(scope="session")
 def od_cup_mask_path(tmp_session) -> str:
-    """ Returns the path to the downloaded cup image. """
+    """ Returns the path to the downloaded cup mask image. """
    im_url = (
        "https://cvbp.blob.core.windows.net/public/images/cvbp_cup_mask.png"
    )
@ -687,6 +707,22 @@ def od_detections(od_detection_dataset):
    return learner.predict_dl(od_detection_dataset.test_dl, threshold=0)


+# ------|-- Action Recognition ------------------------------------------------
+
+
+@pytest.fixture(scope="session")
+def ar_path(tmp_session) -> str:
+    """ Returns the path to the downloaded cup image. """
+    VID_URL = "https://cvbp.blob.core.windows.net/public/datasets/action_recognition/drinking.mp4"
+    vid_path = os.path.join(tmp_session, "drinking.mp4")
+    urllib.request.urlretrieve(VID_URL, vid_path)
+    return vid_path
+
+
+# TODO
+
+# ----- AML Settings ----------------------------------------------------------
+
@pytest.fixture(scope="session")
 def coco_sample_path(tmpdir_factory) -> str:
    """ Returns the path to a coco-formatted annotation. """
@ -695,9 +731,6 @@ def coco_sample_path(tmpdir_factory) -> str:
    return path


-# ----- AML Settings ----------------------------------------------------------
-
-
 # TODO i can't find where this function is being used
 def pytest_addoption(parser):
    parser.addoption(
@ -767,3 +800,4 @@ def tiny_ic_databunch_valid_features(tiny_ic_databunch):
        tiny_ic_databunch, DatasetType.Valid, learn, embedding_layer
    )
    return features
+
--- a/tests/integration/classification/test_integration_classification_notebooks.py
+++ b/tests/integration/classification/test_integration_classification_notebooks.py
@ -13,7 +13,7 @@ OUTPUT_NOTEBOOK = "output.ipynb"
@pytest.mark.notebooks
@pytest.mark.linuxgpu
 def test_01_notebook_run(classification_notebooks):
-    notebook_path = classification_notebooks["01_training_introduction"]
+    notebook_path = classification_notebooks["01"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -30,7 +30,7 @@ def test_01_notebook_run(classification_notebooks):
@pytest.mark.notebooks
@pytest.mark.linuxgpu
 def test_02_notebook_run(classification_notebooks):
-    notebook_path = classification_notebooks["02_multilabel_classification"]
+    notebook_path = classification_notebooks["02"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -48,7 +48,7 @@ def test_02_notebook_run(classification_notebooks):
@pytest.mark.notebooks
@pytest.mark.linuxgpu
 def test_03_notebook_run(classification_notebooks):
-    notebook_path = classification_notebooks["03_training_accuracy_vs_speed"]
+    notebook_path = classification_notebooks["03"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -65,7 +65,7 @@ def test_03_notebook_run(classification_notebooks):
@pytest.mark.notebooks
@pytest.mark.linuxgpu
 def test_11_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["11_exploring_hyperparameters"]
+    notebook_path = classification_notebooks["11"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -91,7 +91,7 @@ def test_11_notebook_run(classification_notebooks, tiny_ic_data_path):
@pytest.mark.notebooks
@pytest.mark.linuxgpu
 def test_12_notebook_run(classification_notebooks):
-    notebook_path = classification_notebooks["12_hard_negative_sampling"]
+    notebook_path = classification_notebooks["12"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
--- a/tests/smoke/test_azureml_notebooks.py
+++ b/tests/smoke/test_azureml_notebooks.py
@ -23,7 +23,7 @@ def test_ic_20_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = classification_notebooks["20_azure_workspace_setup"]
+    notebook_path = classification_notebooks["20"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -46,9 +46,7 @@ def test_ic_21_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = classification_notebooks[
-        "21_deployment_on_azure_container_instances"
-    ]
+    notebook_path = classification_notebooks["21"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -71,9 +69,7 @@ def test_ic_22_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = classification_notebooks[
-        "22_deployment_on_azure_kubernetes_service"
-    ]
+    notebook_path = classification_notebooks["22"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -96,7 +92,7 @@ def test_ic_23_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = classification_notebooks["23_aci_aks_web_service_testing"]
+    notebook_path = classification_notebooks["23"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -119,9 +115,7 @@ def test_ic_24_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = classification_notebooks[
-        "24_exploring_hyperparameters_on_azureml"
-    ]
+    notebook_path = classification_notebooks["24"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -180,7 +174,7 @@ def test_od_20_notebook_run(
    workspace_name,
    workspace_region,
 ):
-    notebook_path = detection_notebooks["20_deployment_on_kubernetes"]
+    notebook_path = detection_notebooks["20"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
--- a/tests/unit/action_recognition/test_action_recognition_data.py
+++ b/tests/unit/action_recognition/test_action_recognition_data.py
@ -0,0 +1,23 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import os
+from utils_cv.action_recognition.data import (
+    _DatasetSpec,
+    Urls,
+)
+from utils_cv.common.data import data_path
+
+
+def test__DatasetSpec_kinetics():
+    """ Tests DatasetSpec initialize with kinetics classes """
+    kinetics = _DatasetSpec(Urls.kinetics_label_map, 400)
+    kinetics.class_names
+    assert os.path.exists(str(data_path() / "label_map.txt"))
+
+
+def test__DatasetSpec_hmdb():
+    """ Tests DatasetSpec initialize with hmdb51 classes """
+    hmdb51 = _DatasetSpec(Urls.hmdb51_label_map, 51)
+    hmdb51.class_names
+    assert os.path.exists(str(data_path() / "label_map.txt"))
--- a/tests/unit/action_recognition/test_action_recognition_notebooks.py
+++ b/tests/unit/action_recognition/test_action_recognition_notebooks.py
@ -0,0 +1,70 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# This test is based on the test suite implemented for Recommenders project
+# https://github.com/Microsoft/Recommenders/tree/master/tests
+
+import os
+import papermill as pm
+import pytest
+import scrapbook as sb
+
+# Unless manually modified, python3 should be
+# the name of the current jupyter kernel
+# that runs on the activated conda environment
+KERNEL_NAME = "python3"
+OUTPUT_NOTEBOOK = "output.ipynb"
+
+
+@pytest.mark.notebooks
+def test_00_notebook_run(action_recognition_notebooks):
+    notebook_path = action_recognition_notebooks["00"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(PM_VERSION=pm.__version__),
+        kernel_name=KERNEL_NAME,
+    )
+
+    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
+    # TODO add some asserts like below
+    # assert nb_output.scraps["predicted_label"].data == "coffee_mug"
+    # assert nb_output.scraps["predicted_confidence"].data > 0.5
+
+
+@pytest.mark.notebooks
+def test_01_notebook_run(action_recognition_notebooks):
+    # TODO - this notebook relies on downloading hmdb51, so pass for now
+    pass
+
+    # notebook_path = classification_notebooks["01"]
+    # pm.execute_notebook(
+    #     notebook_path,
+    #     OUTPUT_NOTEBOOK,
+    #     parameters=dict(PM_VERSION=pm.__version__),
+    #     kernel_name=KERNEL_NAME,
+    # )
+
+    # nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
+    # TODO add some asserts like below
+    # assert len(nb_output.scraps["training_accuracies"].data) == 1
+
+
+@pytest.mark.notebooks
+def test_02_notebook_run(action_recognition_notebooks):
+    pass
+
+
+@pytest.mark.notebooks
+def test_10_notebook_run(action_recognition_notebooks):
+    notebook_path = action_recognition_notebooks["10"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(PM_VERSION=pm.__version__),
+        kernel_name=KERNEL_NAME,
+    )
+
+    nb_output = sb.read_notebook(OUTPUT_NOTEBOOK)
+    # TODO add some asserts like below
+    # assert len(nb_output.scraps["training_accuracies"].data) == 1
--- a/tests/unit/classification/test_classification_notebooks.py
+++ b/tests/unit/classification/test_classification_notebooks.py
@ -18,7 +18,7 @@ OUTPUT_NOTEBOOK = "output.ipynb"

@pytest.mark.notebooks
 def test_00_notebook_run(classification_notebooks):
-    notebook_path = classification_notebooks["00_webcam"]
+    notebook_path = classification_notebooks["00"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -33,7 +33,7 @@ def test_00_notebook_run(classification_notebooks):

@pytest.mark.notebooks
 def test_01_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["01_training_introduction"]
+    notebook_path = classification_notebooks["01"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -52,7 +52,7 @@ def test_01_notebook_run(classification_notebooks, tiny_ic_data_path):

@pytest.mark.notebooks
 def test_02_notebook_run(classification_notebooks, multilabel_ic_data_path):
-    notebook_path = classification_notebooks["02_multilabel_classification"]
+    notebook_path = classification_notebooks["02"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -71,7 +71,7 @@ def test_02_notebook_run(classification_notebooks, multilabel_ic_data_path):

@pytest.mark.notebooks
 def test_03_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["03_training_accuracy_vs_speed"]
+    notebook_path = classification_notebooks["03"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -93,7 +93,7 @@ def test_03_notebook_run(classification_notebooks, tiny_ic_data_path):

@pytest.mark.notebooks
 def test_10_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["10_image_annotation"]
+    notebook_path = classification_notebooks["10"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -110,7 +110,7 @@ def test_10_notebook_run(classification_notebooks, tiny_ic_data_path):

@pytest.mark.notebooks
 def test_11_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["11_exploring_hyperparameters"]
+    notebook_path = classification_notebooks["11"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
@ -131,7 +131,7 @@ def test_11_notebook_run(classification_notebooks, tiny_ic_data_path):

@pytest.mark.notebooks
 def test_12_notebook_run(classification_notebooks, tiny_ic_data_path):
-    notebook_path = classification_notebooks["12_hard_negative_sampling"]
+    notebook_path = classification_notebooks["12"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
--- a/tests/unit/common/test_common_gpu.py
+++ b/tests/unit/common/test_common_gpu.py
@ -8,6 +8,7 @@ from utils_cv.common.gpu import (
    is_linux,
    is_windows,
    which_processor,
+    system_info,
 )


@ -39,3 +40,7 @@ def test_db_num_workers():
    else:
        assert db_num_workers() == 16
        assert db_num_workers(non_windows_num_workers=7) == 7
+
+
+def test_system_info():
+    system_info()
--- a/tests/unit/common/test_common_misc.py
+++ b/tests/unit/common/test_common_misc.py
@ -2,12 +2,13 @@
 # Licensed under the MIT License.

 import os
+import pytest
 from pathlib import Path
 from PIL import ImageFont

 from fastai.vision import ImageList
 from utils_cv.common.gpu import db_num_workers
-from utils_cv.common.misc import copy_files, set_random_seed, get_font
+from utils_cv.common.misc import copy_files, set_random_seed, get_font, Config


 def test_set_random_seed(tiny_ic_data_path):
@ -75,3 +76,21 @@ def test_get_font():
        type(font) == ImageFont.FreeTypeFont
        or type(font) == ImageFont.ImageFont
    )
+
+
+def test_Config():
+    # test dictionary wrapper to make sure keys can be accessed as attributes
+    cfg = Config({"lr": 0.01, "momentum": 0.95})
+    assert cfg.lr == 0.01 and cfg.momentum == 0.95
+    cfg = Config(lr=0.01, momentum=0.95)
+    assert cfg.lr == 0.01 and cfg.momentum == 0.95
+    cfg = Config({"lr": 0.01}, momentum=0.95)
+    assert cfg.lr == 0.01 and cfg.momentum == 0.95
+    cfg_wrapper = Config(cfg, epochs=3)
+    assert (
+        cfg_wrapper.lr == 0.01
+        and cfg_wrapper.momentum == 0.95
+        and cfg_wrapper.epochs == 3
+    )
+    with pytest.raises(ValueError):
+        Config(3)
--- a/utils_cv/action_recognition/init.py
+++ b/utils_cv/action_recognition/init.py
@ -1 +0,0 @@
-from .common import Config, system_info
--- a/utils_cv/action_recognition/common.py
+++ b/utils_cv/action_recognition/common.py
@ -1,51 +0,0 @@
-# Copyright (c) Microsoft
-# Licensed under the MIT License.
-
-import sys
-
-import torch
-import torch.cuda as cuda
-import torchvision
-
-
-class Config(object):
-    def __init__(self, config=None, **extras):
-        """Dictionary wrapper to access keys as attributes.
-
-        Args:
-            config (dict or Config): Configurations
-            extras (kwargs): Extra configurations
-
-        Examples:
-            >>> cfg = Config({'lr': 0.01}, momentum=0.95)
-            or
-            >>> cfg = Config({'lr': 0.01, 'momentum': 0.95})
-            then, use as follows:
-            >>> print(cfg.lr, cfg.momentum)
-        """
-        if config is not None:
-            if isinstance(config, dict):
-                for k in config:
-                    setattr(self, k, config[k])
-            elif isinstance(config, self.__class__):
-                self.__dict__ = config.__dict__.copy()
-            else:
-                raise ValueError("Unknown config")
-
-        for k, v in extras.items():
-            setattr(self, k, v)
-
-    def get(self, key, default):
-        return getattr(self, key, default)
-
-
-def system_info():
-    print(sys.version, "\n")
-    print("PyTorch {}".format(torch.__version__), "\n")
-    print("Torch-vision {}".format(torchvision.__version__), "\n")
-    print("Available devices:")
-    if cuda.is_available():
-        for i in range(cuda.device_count()):
-            print("{}: {}".format(i, cuda.get_device_name(i)))
-    else:
-        print("CPUs")
--- a/utils_cv/action_recognition/data.py
+++ b/utils_cv/action_recognition/data.py
@ -3,40 +3,35 @@

 import os
 from pathlib import Path
+from typing import Union, List
 from urllib.request import urlretrieve
-import warnings

-import decord
-from einops.layers.torch import Rearrange
-import matplotlib.pyplot as plt
-import numpy as np
-from numpy.random import randint
-import torch
-from torch.utils.data import Dataset
-from torchvision.transforms import Compose
-
-from . import transforms_video as transforms
-from .functional_video import denormalize
-
-
-DEFAULT_MEAN = (0.43216, 0.394666, 0.37645)
-DEFAULT_STD = (0.22803, 0.22145, 0.216989)
+from ..common.data import data_path


 class _DatasetSpec:
-    def __init__(self, label_url, root, num_classes):
+    """ Properties of a Video Dataset. """
+
+    def __init__(
+        self,
+        label_url: str,
+        num_classes: int,
+        data_path: Union[Path, str] = data_path(),
+    ) -> None:
        self.label_url = label_url
-        self.root = root
        self.num_classes = num_classes
+        self.data_path = data_path
        self._class_names = None

    @property
-    def class_names(self):
+    def class_names(self) -> List[str]:
        if self._class_names is None:
-            label_filepath = os.path.join(self.root, "label_map.txt")
+            label_filepath = os.path.join(self.data_path, "label_map.txt")
            if not os.path.isfile(label_filepath):
-                os.makedirs(self.root, exist_ok=True)
-                urlretrieve(self.label_url, label_filepath)
+                os.makedirs(self.data_path, exist_ok=True)
+            else:
+                os.remove(label_filepath)
+            urlretrieve(self.label_url, label_filepath)
            with open(label_filepath) as f:
                self._class_names = [l.strip() for l in f]
            assert len(self._class_names) == self.num_classes
@ -44,259 +39,15 @@ class _DatasetSpec:
        return self._class_names


+class Urls:
+    kinetics_label_map = "https://github.com/microsoft/ComputerVision/files/3746975/kinetics400_lable_map.txt"
+    hmdb51_label_map = "https://github.com/microsoft/ComputerVision/files/3746963/hmdb51_label_map.txt"
+
+
 KINETICS = _DatasetSpec(
-    "https://github.com/microsoft/ComputerVision/files/3746975/kinetics400_lable_map.txt",
-    os.path.join("data", "kinetics400"),
-    400
+    Urls.kinetics_label_map, 400, os.path.join("data", "kinetics400"),
 )

 HMDB51 = _DatasetSpec(
-    "https://github.com/microsoft/ComputerVision/files/3746963/hmdb51_label_map.txt",
-    os.path.join("data", "hmdb51"),
-    51
+    Urls.hmdb51_label_map, 51, os.path.join("data", "hmdb51"),
 )
-
-
-class VideoRecord(object):
-    def __init__(self, row):
-        self._data = row
-        self._num_frames = -1
-
-    @property
-    def path(self):
-        return self._data[0]
-
-    @property
-    def num_frames(self):
-        if self._num_frames == -1:
-            self._num_frames = int(len([x for x in Path(self._data[0]).glob('img_*')]) - 1)
-        return self._num_frames
-
-    @property
-    def label(self):
-        return int(self._data[1])
-
-
-class VideoDataset(Dataset):
-    """
-    Args:
-        split_file (str): Annotation file containing video filenames and labels.
-        video_dir (str): Videos directory.
-        num_segments (int): Number of clips to sample from each video.
-        sample_length (int): Number of consecutive frames to sample from a video (i.e. clip length).
-        sample_step (int): Sampling step.
-        input_size (int or tuple): Model input image size.
-        im_scale (int or tuple): Resize target size.
-        resize_keep_ratio (bool): If True, keep the original ratio when resizing.
-        mean (tuple): Normalization mean.
-        std (tuple): Normalization std.
-        random_shift (bool): Random temporal shift when sample a clip.
-        temporal_jitter (bool): Randomly skip frames when sampling each frames.
-        flip_ratio (float): Horizontal flip ratio.
-        random_crop (bool): If False, do center-crop.
-        random_crop_scales (tuple): Range of size of the origin size random cropped.
-        video_ext (str): Video file extension.
-        warning (bool): On or off warning.
-    """
-    def __init__(
-        self,
-        split_file,
-        video_dir,
-        num_segments=1,
-        sample_length=8,
-        sample_step=1,
-        input_size=112,
-        im_scale=128,
-        resize_keep_ratio=True,
-        mean=DEFAULT_MEAN,
-        std=DEFAULT_STD,
-        random_shift=False,
-        temporal_jitter=False,
-        flip_ratio=0.5,
-        random_crop=False,
-        random_crop_scales=(0.6, 1.0),
-        video_ext="mp4",
-        warning=False,
-    ):
-        # TODO maybe check wrong arguments to early failure
-        assert sample_step > 0
-        assert num_segments > 0
-
-        self.video_dir = video_dir
-        self.video_records = [
-            VideoRecord(x.strip().split(" ")) for x in open(split_file)
-        ]
-
-        self.num_segments = num_segments
-        self.sample_length = sample_length
-        self.sample_step = sample_step
-        self.presample_length = sample_length * sample_step
-
-        # Temporal noise
-        self.random_shift = random_shift
-        self.temporal_jitter = temporal_jitter
-
-        # Video transforms
-        # 1. resize
-        trfms = [
-            transforms.ToTensorVideo(),
-            transforms.ResizeVideo(im_scale, resize_keep_ratio),
-        ]
-        # 2. crop
-        if random_crop:
-            if random_crop_scales is not None:
-                crop = transforms.RandomResizedCropVideo(input_size, random_crop_scales)
-            else:
-                crop = transforms.RandomCropVideo(input_size)
-        else:
-            crop = transforms.CenterCropVideo(input_size)
-        trfms.append(crop)
-        # 3. flip
-        trfms.append(transforms.RandomHorizontalFlipVideo(flip_ratio))
-        # 4. normalize
-        trfms.append(transforms.NormalizeVideo(mean, std))
-        self.transforms = Compose(trfms)
-        self.video_ext = video_ext
-        self.warning = warning
-
-    def __len__(self):
-        return len(self.video_records)
-
-    def _sample_indices(self, record):
-        """
-        Args:
-            record (VideoRecord): A video record.
-        Return:
-            list: Segment offsets (start indices)
-        """
-        if record.num_frames > self.presample_length:
-            if self.random_shift:
-                # Random sample
-                offsets = np.sort(
-                    randint(
-                        record.num_frames - self.presample_length + 1,
-                        size=self.num_segments,
-                    )
-                )
-            else:
-                # Uniform sample
-                distance = (record.num_frames - self.presample_length + 1) / self.num_segments
-                offsets = np.array(
-                    [int(distance / 2.0 + distance * x) for x in range(self.num_segments)]
-                )
-        else:
-            if self.warning:
-                warnings.warn(
-                    "num_segments and/or sample_length > num_frames in {}".format(
-                        record.path
-                    )
-                )
-            offsets = np.zeros((self.num_segments,), dtype=int)
-
-        return offsets
-
-    def _get_frames(self, video_reader, offset):
-        clip = list()
-
-        # decord.seek() seems to have a bug. use seek_accurate().
-        video_reader.seek_accurate(offset)
-        # first frame
-        clip.append(video_reader.next().asnumpy())
-        # remaining frames
-        try:
-            if self.temporal_jitter:
-                for i in range(self.sample_length - 1):
-                    step = randint(self.sample_step + 1)
-                    if step == 0:
-                        clip.append(clip[-1].copy())
-                    else:
-                        if step > 1:
-                            video_reader.skip_frames(step - 1)
-                        cur_frame = video_reader.next().asnumpy()
-                        if len(cur_frame.shape) != 3:
-                            # maybe end of the video
-                            break
-                        clip.append(cur_frame)
-            else:
-                for i in range(self.sample_length - 1):
-                    if self.sample_step > 1:
-                        video_reader.skip_frames(self.sample_step - 1)
-                    cur_frame = video_reader.next().asnumpy()
-                    if len(cur_frame.shape) != 3:
-                        # maybe end of the video
-                        break
-                    clip.append(cur_frame)
-        except StopIteration:
-            pass
-
-        # if clip needs more frames, simply duplicate the last frame in the clip.
-        while len(clip) < self.sample_length:
-            clip.append(clip[-1].copy())
-                
-        return clip
-
-    def __getitem__(self, idx):
-        """
-        Return:
-            clips (torch.tensor), label (int)
-        """
-        record = self.video_records[idx]
-        video_reader = decord.VideoReader(
-            "{}.{}".format(os.path.join(self.video_dir, record.path), self.video_ext),
-            # TODO try to add `ctx=decord.ndarray.gpu(0) or .cuda(0)`
-        )
-        record._num_frames = len(video_reader)
-
-        offsets = self._sample_indices(record)
-        clips = np.array([self._get_frames(video_reader, o) for o in offsets])
-
-        if self.num_segments == 1:
-            # [T, H, W, C] -> [C, T, H, W]
-            return self.transforms(torch.from_numpy(clips[0])), record.label
-        else:
-            # [S, T, H, W, C] -> [S, C, T, H, W]
-            return (
-                torch.stack([
-                    self.transforms(torch.from_numpy(c)) for c in clips
-                ]),
-                record.label
-            )
-
-
-def show_batch(batch, sample_length, mean=DEFAULT_MEAN, std=DEFAULT_STD):
-    """
-    Args:
-        batch (list[torch.tensor]): List of sample (clip) tensors
-        sample_length (int): Number of frames to show for each sample
-        mean (tuple): Normalization mean
-        std (tuple): Normalization std-dev
-    """
-    batch_size = len(batch)
-    plt.tight_layout()
-    fig, axs = plt.subplots(
-        batch_size,
-        sample_length,
-        figsize=(4 * sample_length, 3 * batch_size)
-    )
-
-    for i, ax in enumerate(axs):
-        if batch_size == 1:
-            clip = batch[0]
-        else:
-            clip = batch[i]
-        clip = Rearrange("c t h w -> t c h w")(clip)
-        if not isinstance(ax, np.ndarray):
-            ax = [ax]
-        for j, a in enumerate(ax):
-            a.axis("off")
-            a.imshow(
-                np.moveaxis(
-                    denormalize(
-                        clip[j],
-                        mean,
-                        std,
-                    ).numpy(),
-                    0,
-                    -1,
-                )
-            )
--- a/utils_cv/action_recognition/dataset.py
+++ b/utils_cv/action_recognition/dataset.py
@ -0,0 +1,498 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import os
+import copy
+from pathlib import Path
+import warnings
+from typing import Callable, Tuple, Union, List
+
+import decord
+from einops.layers.torch import Rearrange
+import matplotlib.pyplot as plt
+import numpy as np
+from numpy.random import randint
+import torch
+from torch.utils.data import Dataset, Subset, DataLoader
+from torchvision.transforms import Compose
+
+from .references import transforms_video as transforms
+from .references.functional_video import denormalize
+
+from ..common.misc import Config
+from ..common.gpu import num_devices
+
+Trans = Callable[[object, dict], Tuple[object, dict]]
+
+DEFAULT_MEAN = (0.43216, 0.394666, 0.37645)
+DEFAULT_STD = (0.22803, 0.22145, 0.216989)
+
+
+class VideoRecord(object):
+    """
+    This class is used for parsing split-files where each row contains a path
+    and a label:
+
+    Ex:
+    ```
+    path/to/my/clip.mp4 3
+    path/to/another/clip.mp4 32
+    ```
+    """
+
+    def __init__(self, data: List[str]):
+        """ Initialized a VideoRecord
+
+        Args:
+            row: a list where first element is the path and second element is
+            the label
+        """
+        self._data = data
+        self._num_frames = None
+
+    @property
+    def path(self) -> str:
+        return self._data[0]
+
+    @property
+    def num_frames(self) -> int:
+        if self._num_frames is None:
+            self._num_frames = int(
+                len([x for x in Path(self._data[0]).glob("img_*")]) - 1
+            )
+        return self._num_frames
+
+    @property
+    def label(self) -> int:
+        return int(self._data[1])
+
+
+def get_transforms(train: bool, tfms_config: Config = None) -> Trans:
+    """ Get default transformations to apply depending on whether we're applying it to the training or the validation set. If no tfms configurations are passed in, use the defaults.
+
+    Args:
+        train: whether or not this is for training
+        tfms_config: Config object with tranforms-related configs
+
+    Returns:
+        A list of transforms to apply
+    """
+    if tfms_config is None:
+        tfms_config = (
+            get_default_tfms_config(train=True)
+            if train
+            else get_default_tfms_config(train=False)
+        )
+
+    # 1. resize
+    tfms = [
+        transforms.ToTensorVideo(),
+        transforms.ResizeVideo(
+            tfms_config.im_scale, tfms_config.resize_keep_ratio
+        ),
+    ]
+    # 2. crop
+    if tfms_config.random_crop:
+        if tfms_config.random_crop_scales:
+            crop = transforms.RandomResizedCropVideo(
+                tfms_config.input_size, tfms_config.random_crop_scales
+            )
+        else:
+            crop = transforms.RandomCropVideo(tfms_config.input_size)
+    else:
+        crop = transforms.CenterCropVideo(tfms_config.input_size)
+    tfms.append(crop)
+    # 3. flip
+    tfms.append(transforms.RandomHorizontalFlipVideo(tfms_config.flip_ratio))
+    # 4. normalize
+    tfms.append(transforms.NormalizeVideo(tfms_config.mean, tfms_config.std))
+
+    return Compose(tfms)
+
+
+def get_default_tfms_config(train: bool) -> Config:
+    """
+    Args:
+        train: whether or not this is for training
+
+    Settings:
+        input_size (int or tuple): Model input image size.
+        im_scale (int or tuple): Resize target size.
+        resize_keep_ratio (bool): If True, keep the original ratio when resizing.
+        mean (tuple): Normalization mean.
+        if train:
+        std (tuple): Normalization std.
+        flip_ratio (float): Horizontal flip ratio.
+        random_crop (bool): If False, do center-crop.
+        random_crop_scales (tuple): Range of size of the origin size random cropped.
+    """
+    flip_ratio = 0.5 if train else 0.0
+    random_crop = True if train else False
+    random_crop_scales = (0.6, 1.0) if train else None
+
+    return Config(
+        dict(
+            input_size=112,
+            im_scale=128,
+            resize_keep_ratio=True,
+            mean=DEFAULT_MEAN,
+            std=DEFAULT_STD,
+            flip_ratio=flip_ratio,
+            random_crop=random_crop,
+            random_crop_scales=random_crop_scales,
+        )
+    )
+
+
+class VideoDataset:
+    """ A video recognition dataset. """
+
+    def __init__(
+        self,
+        root: str,
+        train_pct: float = 0.75,
+        num_samples: int = 1,
+        sample_length: int = 8,
+        sample_step: int = 1,
+        temporal_jitter: bool = True,
+        temporal_jitter_step: int = 2,
+        random_shift: bool = True,
+        batch_size: int = 8,
+        video_ext: str = "mp4",
+        warning: bool = False,
+        train_split_file: str = None,
+        test_split_file: str = None,
+        train_transforms: Trans = get_transforms(train=True),
+        test_transforms: Trans = get_transforms(train=False),
+    ) -> None:
+        """ initialize dataset
+
+        Arg:
+            root: Videos directory.
+            train_pct: percentage of dataset to use for training
+            num_samples: Number of clips to sample from each video.
+            sample_length: Number of consecutive frames to sample from a video (i.e. clip length).
+            sample_step: Sampling step.
+            temporal_jitter: Randomly skip frames when sampling each frames.
+            temporal_jitter_step: temporal jitter in frames
+            random_shift: Random temporal shift when sample a clip.
+            video_ext: Video file extension.
+            warning: On or off warning.
+            train_split_file: Annotation file containing video filenames and labels.
+            test_split_file: Annotation file containing video filenames and labels.
+            train_transforms: transforms for training
+            test_transforms: transforms for testing
+        """
+
+        # TODO check wrong arguments early to prevent failure
+        assert sample_step > 0
+        assert num_samples > 0
+
+        if temporal_jitter:
+            assert temporal_jitter_step > 0
+
+        if train_split_file:
+            assert Path(train_split_file).exists()
+            assert (
+                test_split_file is not None and Path(test_split_file).exists()
+            )
+
+        if test_split_file:
+            assert Path(test_split_file).exists()
+            assert (
+                train_split_file is not None
+                and Path(train_split_file).exists()
+            )
+
+        self.root = root
+        self.num_samples = num_samples
+        self.sample_length = sample_length
+        self.sample_step = sample_step
+        self.presample_length = sample_length * sample_step
+        self.temporal_jitter_step = temporal_jitter_step
+        self.train_transforms = train_transforms
+        self.test_transforms = test_transforms
+        self.random_shift = random_shift
+        self.temporal_jitter = temporal_jitter
+        self.batch_size = batch_size
+        self.video_ext = video_ext
+        self.warning = warning
+
+        # create training and validation datasets
+        self.train_ds, self.test_ds = (
+            self.split_with_file(
+                train_split_file=train_split_file,
+                test_split_file=test_split_file,
+            )
+            if train_split_file
+            else self.split_train_test(train_pct=train_pct)
+        )
+
+        # initialize dataloaders
+        self.init_data_loaders()
+
+    def split_train_test(
+        self, train_pct: float = 0.8
+    ) -> Tuple[Dataset, Dataset]:
+        """ Split this dataset into a training and testing set
+
+        Args:
+            train_pct: the ratio of images to use for training vs
+            testing
+
+        Return
+            A training and testing dataset in that order
+        """
+        pass
+
+    def split_with_file(
+        self,
+        train_split_file: Union[Path, str],
+        test_split_file: Union[Path, str],
+    ) -> Tuple[Dataset, Dataset]:
+        """ Split this dataset into a training and testing set using a split file.
+
+        Each line in the split file must use the form:
+        ```
+        path/to/jumping/video.mp4 3
+        path/to/swimming/video.mp4 5
+        path/to/another/jumping/video.mp4 3
+        ```
+
+        Args:
+            split_files: a tuple of 2 files
+
+        Return:
+            A training and testing dataset in that order
+        """
+        self.video_records = []
+
+        # add train records
+        self.video_records.extend(
+            [
+                VideoRecord(row.strip().split(" "))
+                for row in open(train_split_file)
+            ]
+        )
+        train_len = len(self.video_records)
+
+        # add validation records
+        self.video_records.extend(
+            [
+                VideoRecord(row.strip().split(" "))
+                for row in open(test_split_file)
+            ]
+        )
+
+        # create indices
+        indices = torch.arange(0, len(self.video_records))
+        train_range = indices[:train_len]
+        test_range = indices[train_len:]
+
+        # create train subset
+        train = copy.deepcopy(Subset(self, train_range))
+        train.dataset.transforms = self.train_transforms
+        train.dataset.sample_step = (
+            self.temporal_jitter_step
+            if self.temporal_jitter
+            else self.sample_step
+        )
+        train.dataset.presample_length = self.sample_length * self.sample_step
+
+        # create test subset
+        test = copy.deepcopy(Subset(self, test_range))
+        test.dataset.transforms = self.test_transforms
+        test.dataset.random_shift = False
+        test.dataset.temporal_jitter = False
+
+        return train, test
+
+    def init_data_loaders(self) -> None:
+        """ Create training and validation data loaders. """
+        devices = num_devices()
+
+        self.train_dl = DataLoader(
+            self.train_ds,
+            batch_size=self.batch_size * devices,
+            shuffle=True,
+            num_workers=0,  # Torch 1.2 has a bug when num-workers > 0 (0 means run a main-processor worker)
+            pin_memory=True,
+        )
+
+        self.test_dl = DataLoader(
+            self.test_ds,
+            batch_size=self.batch_size * devices,
+            shuffle=False,
+            num_workers=0,
+            pin_memory=True,
+        )
+
+    def __len__(self) -> int:
+        return len(self.video_records)
+
+    def _sample_indices(self, record: VideoRecord) -> List[int]:
+        """
+        Create a list of frame-wise offsets into a video record. Depending on
+        whether or not 'random shift' is used, perform a uniform sample or a
+        random sample.
+
+        Args:
+            record (VideoRecord): A video record.
+
+        Return:
+            list: Segment offsets (start indices)
+        """
+        if record.num_frames > self.presample_length:
+            if self.random_shift:
+                # Random sample
+                offsets = np.sort(
+                    randint(
+                        record.num_frames - self.presample_length + 1,
+                        size=self.num_samples,
+                    )
+                )
+            else:
+                # Uniform sample
+                distance = (
+                    record.num_frames - self.presample_length + 1
+                ) / self.num_samples
+                offsets = np.array(
+                    [
+                        int(distance / 2.0 + distance * x)
+                        for x in range(self.num_samples)
+                    ]
+                )
+        else:
+            if self.warning:
+                warnings.warn(
+                    f"num_samples and/or sample_length > num_frames in {record.path}"
+                )
+            offsets = np.zeros((self.num_samples,), dtype=int)
+
+        return offsets
+
+    def _get_frames(
+        self, video_reader: decord.VideoReader, offset: int,
+    ) -> List[np.ndarray]:
+        """ Get frames at sample length.
+
+        Args:
+            video_reader: the decord tool for parsing videos
+            offset: where to start the reader from
+
+        Returns
+            Frames at sample length in a List
+        """
+        clip = list()
+
+        # decord.seek() seems to have a bug. use seek_accurate().
+        video_reader.seek_accurate(offset)
+
+        # first frame
+        clip.append(video_reader.next().asnumpy())
+
+        # remaining frames
+        try:
+            for i in range(self.sample_length - 1):
+                step = (
+                    randint(self.sample_step + 1)
+                    if self.temporal_jitter
+                    else self.sample_step
+                )
+
+                if step == 0 and self.temporal_jitter:
+                    clip.append(clip[-1].copy())
+                else:
+                    if step > 1:
+                        video_reader.skip_frames(step - 1)
+                    cur_frame = video_reader.next().asnumpy()
+                    clip.append(cur_frame)
+
+        except StopIteration:
+            # pass when video has ended
+            pass
+
+        # if clip needs more frames, simply duplicate the last frame in the clip.
+        while len(clip) < self.sample_length:
+            clip.append(clip[-1].copy())
+
+        return clip
+
+    def __getitem__(self, idx: int) -> Tuple[torch.tensor, int]:
+        """
+        Return:
+            clips (torch.tensor), label (int)
+        """
+        record = self.video_records[idx]
+        video_reader = decord.VideoReader(
+            "{}.{}".format(
+                os.path.join(self.root, record.path), self.video_ext
+            ),
+            # TODO try to add `ctx=decord.ndarray.gpu(0) or .cuda(0)`
+        )
+        record._num_frames = len(video_reader)
+
+        offsets = self._sample_indices(record)
+        clips = np.array([self._get_frames(video_reader, o) for o in offsets])
+
+        if self.num_samples == 1:
+            # [T, H, W, C] -> [C, T, H, W]
+            return self.transforms(torch.from_numpy(clips[0])), record.label
+        else:
+            # [S, T, H, W, C] -> [S, C, T, H, W]
+            return (
+                torch.stack(
+                    [self.transforms(torch.from_numpy(c)) for c in clips]
+                ),
+                record.label,
+            )
+
+    def _show_batch(
+        self,
+        batch: List[torch.tensor],
+        sample_length: int,
+        mean: Tuple[int, int, int] = DEFAULT_MEAN,
+        std: Tuple[int, int, int] = DEFAULT_STD,
+    ) -> None:
+        """
+        Display a batch of images.
+
+        Args:
+            batch: List of sample (clip) tensors
+            sample_length: Number of frames to show for each sample
+            mean: Normalization mean
+            std: Normalization std-dev
+        """
+        batch_size = len(batch)
+        plt.tight_layout()
+        fig, axs = plt.subplots(
+            batch_size,
+            sample_length,
+            figsize=(4 * sample_length, 3 * batch_size),
+        )
+
+        for i, ax in enumerate(axs):
+            if batch_size == 1:
+                clip = batch[0]
+            else:
+                clip = batch[i]
+            clip = Rearrange("c t h w -> t c h w")(clip)
+            if not isinstance(ax, np.ndarray):
+                ax = [ax]
+            for j, a in enumerate(ax):
+                a.axis("off")
+                a.imshow(
+                    np.moveaxis(denormalize(clip[j], mean, std).numpy(), 0, -1)
+                )
+            pass
+
+    def show_batch(self, train_or_test: str = "train", rows: int = 1) -> None:
+        """Plot first few samples in the datasets"""
+        if train_or_test == "train":
+            batch = [self.train_ds.dataset[i][0] for i in range(rows)]
+        elif train_or_test == "valid":
+            batch = [self.test_ds.dataset[i][0] for i in range(rows)]
+        else:
+            raise ValueError("Unknown data type {}".format(which_data))
+
+        self._show_batch(batch, self.sample_length)
--- a/utils_cv/action_recognition/model.py
+++ b/utils_cv/action_recognition/model.py
@ -5,198 +5,137 @@ from collections import OrderedDict
 import os
 import time
 import warnings
+from typing import Union
+from pathlib import Path

 try:
    from apex import amp
+
    AMP_AVAILABLE = True
 except ModuleNotFoundError:
    AMP_AVAILABLE = False

-from IPython.core.debugger import set_trace
-import numpy as np
 import torch
-import torch.cuda as cuda
 import torch.nn as nn
 import torch.optim as optim
-from torch.utils.data import DataLoader
+import torchvision

-from . import Config
-from .data import (
-    DEFAULT_MEAN,
-    DEFAULT_STD,
-    show_batch as _show_batch,
-    VideoDataset,
-)
+from ..common.misc import Config
+from ..common.gpu import torch_device, num_devices
+from .dataset import VideoDataset

-from .metrics import accuracy, AverageMeter
+from .references.metrics import accuracy, AverageMeter

-# From https://github.com/moabitcoin/ig65m-pytorch
-TORCH_R2PLUS1D = "moabitcoin/ig65m-pytorch"
+# These paramaters are set so that we can use torch hub to download pretrained
+# models from the specified repo
+TORCH_R2PLUS1D = "moabitcoin/ig65m-pytorch"  # From https://github.com/moabitcoin/ig65m-pytorch
 MODELS = {
-    # model: output classes
-    'r2plus1d_34_32_ig65m': 359,
-    'r2plus1d_34_32_kinetics': 400,
-    'r2plus1d_34_8_ig65m': 487,
-    'r2plus1d_34_8_kinetics': 400,
+    # Model name followed by the number of output classes.
+    "r2plus1d_34_32_ig65m": 359,
+    "r2plus1d_34_32_kinetics": 400,
+    "r2plus1d_34_8_ig65m": 487,
+    "r2plus1d_34_8_kinetics": 400,
 }


-class R2Plus1D(object):
-    def __init__(self, cfgs):
-        self.configs = Config(cfgs)
-        self.train_ds, self.valid_ds = self.load_datasets(self.configs)
-        self.model = self.init_model(
-            self.configs.sample_length,
-            self.configs.base_model,
-            self.configs.num_classes
+class VideoLearner(object):
+    """ Video recognition learner object that handles training loop and evaluation. """
+
+    def __init__(
+        self,
+        dataset: VideoDataset,
+        num_classes: int,  # ie 51 for hmdb51
+        base_model: str = "ig65m",  # or "kinetics"
+    ) -> None:
+        """ By default, the Video Learner will use a R2plus1D model. Pass in
+        a dataset of type Video Dataset and the Video Learner will intialize
+        the model.
+
+        Args:
+            dataset: the datset to use for this model
+            num_class: the number of actions/classifications
+            base_model: the R2plus1D model is based on either ig65m or
+            kinetics. By default it will use the weights from ig65m since it
+            tends attain higher results.
+        """
+        self.dataset = dataset
+        self.model, self.model_name = self.init_model(
+            self.dataset.sample_length, base_model, num_classes,
        )
-        self.model_name = "r2plus1d_34_{}_{}".format(self.configs.sample_length, self.configs.base_model)

    @staticmethod
-    def init_model(sample_length, base_model, num_classes=None):
-        if base_model not in ('ig65m', 'kinetics'):
+    def init_model(
+        sample_length: int, base_model: str, num_classes: int = None
+    ) -> torchvision.models.video.resnet.VideoResNet:
+        """
+        Initializes the model by loading it using torch's `hub.load`
+        functionality. Uses the model from TORCH_R2PLUS1D.
+
+        Args:
+            sample_length: Number of consecutive frames to sample from a video (i.e. clip length).
+            base_model: the R2plus1D model is based on either ig65m or kinetics.
+            num_classes: the number of classes/actions
+
+        Returns:
+            Load a model from a github repo, with pretrained weights
+        """
+        if base_model not in ("ig65m", "kinetics"):
            raise ValueError(
-                "Not supported model {}. Should be 'ig65m' or 'kinetics'"
-                .format(base_model)
+                f"Not supported model {base_model}. Should be 'ig65m' or 'kinetics'"
            )

        # Decide if to use pre-trained weights for DNN trained using 8 or for 32 frames
-        if sample_length<=8:
+        if sample_length <= 8:
            model_sample_length = 8
        else:
            model_sample_length = 32
-        model_name = "r2plus1d_34_{}_{}".format(model_sample_length, base_model)

-        print("Loading {} model".format(model_name))
+        model_name = f"r2plus1d_34_{model_sample_length}_{base_model}"
+
+        print(f"Loading {model_name} model")

        model = torch.hub.load(
-            TORCH_R2PLUS1D, model_name, num_classes=MODELS[model_name], pretrained=True
+            TORCH_R2PLUS1D,
+            model_name,
+            num_classes=MODELS[model_name],
+            pretrained=True,
        )

        # Replace head
        if num_classes is not None:
            model.fc = nn.Linear(model.fc.in_features, num_classes)
-        return model

-    @staticmethod
-    def load_datasets(cfgs):
-        """Load VideoDataset
+        return model, model_name

-        Args:
-            cfgs (dict or Config): Dataset configuration. For validation dataset,
-                data augmentation such as random shift and temporal jitter is not used.
-
-        Return:
-             VideoDataset, VideoDataset: Train and validation datasets.
-                If split file is not provided, returns None.
-        """
-        cfgs = Config(cfgs)
-
-        train_split = cfgs.get('train_split', None)
-        train_ds = None if train_split is None else VideoDataset(
-            split_file=train_split,
-            video_dir=cfgs.video_dir,
-            num_segments=1,
-            sample_length=cfgs.sample_length,
-            sample_step=cfgs.get('temporal_jitter_step', cfgs.get('sample_step', 1)),
-            input_size=112,
-            im_scale=cfgs.get('im_scale', 128),
-            resize_keep_ratio=cfgs.get('resize_keep_ratio', True),
-            mean=cfgs.get('mean', DEFAULT_MEAN),
-            std=cfgs.get('std', DEFAULT_STD),
-            random_shift=cfgs.get('random_shift', True),
-            temporal_jitter=True if cfgs.get('temporal_jitter_step', 0) > 0 else False,
-            flip_ratio=cfgs.get('flip_ratio', 0.5),
-            random_crop=cfgs.get('random_crop', True),
-            random_crop_scales=cfgs.get('random_crop_scales', (0.6, 1.0)),
-            video_ext=cfgs.video_ext,
-        )
-
-        valid_split = cfgs.get('valid_split', None)
-        valid_ds = None if valid_split is None else VideoDataset(
-            split_file=valid_split,
-            video_dir=cfgs.video_dir,
-            num_segments=1,
-            sample_length=cfgs.sample_length,
-            sample_step=cfgs.get('sample_step', 1),
-            input_size=112,
-            im_scale=cfgs.get('im_scale', 128),
-            resize_keep_ratio=True,
-            mean=cfgs.get('mean', DEFAULT_MEAN),
-            std=cfgs.get('std', DEFAULT_STD),
-            random_shift=False,
-            temporal_jitter=False,
-            flip_ratio=0.0,
-            random_crop=False,  # == Center crop
-            random_crop_scales=None,
-            video_ext=cfgs.video_ext,
-        )
-
-        return train_ds, valid_ds
-
-    def show_batch(self, which_data='train', num_samples=1):
-        """Plot first few samples in the datasets"""
-        if which_data == 'train':
-            batch = [self.train_ds[i][0] for i in range(num_samples)]
-        elif which_data == 'valid':
-            batch = [self.valid_ds[i][0] for i in range(num_samples)]
-        else:
-            raise ValueError("Unknown data type {}".format(which_data))
-        _show_batch(
-            batch,
-            self.configs.sample_length,
-            mean=self.configs.get('mean', DEFAULT_MEAN),
-            std=self.configs.get('std', DEFAULT_STD),
-        )
-
-    def freeze(self):
+    def freeze(self) -> None:
        """Freeze model except the last layer"""
        self._set_requires_grad(False)
        for param in self.model.fc.parameters():
            param.requires_grad = True

-    def unfreeze(self):
+    def unfreeze(self) -> None:
        self._set_requires_grad(True)

-    def _set_requires_grad(self, requires_grad=True):
+    def _set_requires_grad(self, requires_grad=True) -> None:
        for param in self.model.parameters():
            param.requires_grad = requires_grad

-    def fit(self, train_cfgs):
+    def fit(self, train_cfgs) -> None:
+        """ The primary fit function """
        train_cfgs = Config(train_cfgs)

-        model_dir = train_cfgs.get('model_dir', "checkpoints")
+        model_dir = train_cfgs.get("model_dir", "checkpoints")
        os.makedirs(model_dir, exist_ok=True)

-        if cuda.is_available():
-            device = torch.device("cuda")
-            num_devices = cuda.device_count()
-            # Look for the optimal set of algorithms to use in cudnn. Use this only with fixed-size inputs.
-            torch.backends.cudnn.benchmark = True
-        else:
-            device = torch.device("cpu")
-            num_devices = 1
-
        data_loaders = {}
-        if self.train_ds is not None:
-            data_loaders['train'] = DataLoader(
-                self.train_ds,
-                batch_size=train_cfgs.get('batch_size', 8) * num_devices,
-                shuffle=True,
-                num_workers=0,  # Torch 1.2 has a bug when num-workers > 0 (0 means run a main-processor worker)
-                pin_memory=True,
-            )
-        if self.valid_ds is not None:
-            data_loaders['valid'] = DataLoader(
-                self.valid_ds,
-                batch_size=train_cfgs.get('batch_size', 8) * num_devices,
-                shuffle=False,
-                num_workers=0,
-                pin_memory=True,
-            )
+        data_loaders["train"] = self.dataset.train_dl
+        data_loaders["valid"] = self.dataset.test_dl

        # Move model to gpu before constructing optimizers and amp.initialize
+        device = torch_device()
        self.model.to(device)
+        count_devices = num_devices()
+        torch.backends.cudnn.benchmark = True

        named_params_to_update = {}
        total_params = 0
@ -210,19 +149,22 @@ class R2Plus1D(object):
            print("\tfull network")
        else:
            for name in named_params_to_update:
-                print("\t{}".format(name))
+                print(f"\t{name}")

-        momentum=train_cfgs.get('momentum', 0.95)
+        # create optimizer
+        momentum = train_cfgs.get("momentum", 0.95)
        optimizer = optim.SGD(
            list(named_params_to_update.values()),
            lr=train_cfgs.lr,
            momentum=momentum,
-            weight_decay=train_cfgs.get('weight_decay', 0.0001),
+            weight_decay=train_cfgs.get("weight_decay", 0.0001),
        )

        # Use mixed-precision if available
        # Currently, only O1 works with DataParallel: See issues https://github.com/NVIDIA/apex/issues/227
-        if train_cfgs.get('mixed_prec', False) and AMP_AVAILABLE:
+        if train_cfgs.get("mixed_prec", False):
+            # break if not AMP_AVAILABLE
+            assert AMP_AVAILABLE
            # 'O0': Full FP32, 'O1': Conservative, 'O2': Standard, 'O3': Full FP16
            self.model, optimizer = amp.initialize(
                self.model,
@ -233,36 +175,34 @@ class R2Plus1D(object):
            )

        # Learning rate scheduler
-        if train_cfgs.get('use_one_cycle_policy', False):
+        if train_cfgs.get("use_one_cycle_policy", False):
            # Use warmup with the one-cycle policy
            scheduler = torch.optim.lr_scheduler.OneCycleLR(
                optimizer,
                max_lr=train_cfgs.lr,
                total_steps=train_cfgs.epochs,
-                pct_start=train_cfgs.get('warmup_pct', 0.3),
-                base_momentum=0.9*momentum,
+                pct_start=train_cfgs.get("warmup_pct", 0.3),
+                base_momentum=0.9 * momentum,
                max_momentum=momentum,
            )
        else:
            # Simple step-decay
            scheduler = torch.optim.lr_scheduler.StepLR(
                optimizer,
-                step_size=train_cfgs.get('lr_step_size', float("inf")),
-                gamma=train_cfgs.get('lr_gamma', 0.1),
+                step_size=train_cfgs.get("lr_step_size", float("inf")),
+                gamma=train_cfgs.get("lr_gamma", 0.1),
            )

        # DataParallel after amp.initialize
-        if num_devices > 1:
-            model = nn.DataParallel(self.model)
-        else:
-            model = self.model
+        model = (
+            nn.DataParallel(self.model) if count_devices > 1 else self.model
+        )

        criterion = nn.CrossEntropyLoss().to(device)

        for e in range(1, train_cfgs.epochs + 1):
-            print("Epoch {} ==========".format(e))
-            if scheduler is not None:
-                print("lr={}".format(scheduler.get_lr()))
+            print(f"Epoch {e} ==========")
+            print(f"lr={scheduler.get_lr()}")

            self.train_an_epoch(
                model,
@ -276,14 +216,16 @@ class R2Plus1D(object):

            scheduler.step()

-            if train_cfgs.get('save_models', False):   
+            if train_cfgs.get("save_models", False):
                self.save(
                    os.path.join(
                        model_dir,
                        "{model_name}_{epoch}.pt".format(
-                            model_name=train_cfgs.get('model_name', self.model_name),
-                            epoch=str(e).zfill(3)
-                        )
+                            model_name=train_cfgs.get(
+                                "model_name", self.model_name
+                            ),
+                            epoch=str(e).zfill(3),
+                        ),
                    )
                )

@ -296,29 +238,35 @@ class R2Plus1D(object):
        optimizer,
        grad_steps=1,
        mixed_prec=False,
-    ):
+    ) -> None:
        """Train / validate a model for one epoch.

-        :param model:
-        :param data_loaders: dict {'train': train_dl, 'valid': valid_dl}
-        :param device:
-        :param criterion:
-        :param optimizer:
-        :param grad_steps: If > 1, use gradient accumulation. Useful for larger batching
-        :param mixed_prec: If True, use FP16 + FP32 mixed precision via NVIDIA apex.amp
-        :return: dict {
-            'train/time': batch_time.avg,
-            'train/loss': losses.avg,
-            'train/top1': top1.avg,
-            'train/top5': top5.avg,
-            'valid/time': ...
-        }
+        Args:
+            model: the model to use to train
+            data_loaders: dict {'train': train_dl, 'valid': valid_dl}
+            device: gpu or not
+            criterion: TODO
+            optimizer: TODO
+            grad_steps: If > 1, use gradient accumulation. Useful for larger batching
+            mixed_prec: If True, use FP16 + FP32 mixed precision via NVIDIA apex.amp
+
+        Return:
+            dict {
+                'train/time': batch_time.avg,
+                'train/loss': losses.avg,
+                'train/top1': top1.avg,
+                'train/top5': top5.avg,
+                'valid/time': ...
+            }
        """
-        assert "train" in data_loaders
        if mixed_prec and not AMP_AVAILABLE:
            warnings.warn(
-                "NVIDIA apex module is not installed. Cannot use mixed-precision."
+                """
+                NVIDIA apex module is not installed. Cannot use
+                mixed-precision. Turning off mixed-precision.
+                """
            )
+            mixed_prec = False

        result = OrderedDict()
        for phase in ["train", "valid"]:
@ -356,8 +304,10 @@ class R2Plus1D(object):
                        # make the accumulated gradient to be the same scale as without the accumulation
                        loss = loss / grad_steps

-                        if mixed_prec and AMP_AVAILABLE:
-                            with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        if mixed_prec:
+                            with amp.scale_loss(
+                                loss, optimizer
+                            ) as scaled_loss:
                                scaled_loss.backward()
                        else:
                            loss.backward()
@ -371,30 +321,26 @@ class R2Plus1D(object):
                    end = time.time()

            print(
-                "{} took {:.2f} sec: loss = {:.4f}, top1_acc = {:.4f}, top5_acc = {:.4f}".format(
-                    phase, batch_time.sum, losses.avg, top1.avg, top5.avg
-                )
+                f"{phase} took {batch_time.sum:.2f} sec: loss = {losses.avg:.4f}, top1_acc = {top1.avg:.4f}, top5_acc = {top5.avg:.4f}"
            )
-            result["{}/time".format(phase)] = batch_time.sum
-            result["{}/loss".format(phase)] = losses.avg
-            result["{}/top1".format(phase)] = top1.avg
-            result["{}/top5".format(phase)] = top5.avg
+            result[f"{phase}/time"] = batch_time.sum
+            result[f"{phase}/loss"] = losses.avg
+            result[f"{phase}/top1"] = top1.avg
+            result[f"{phase}/top5"] = top5.avg

        return result

-    def save(self, model_path):
-        torch.save(
-            self.model.state_dict(),
-            model_path
-        )
+    def save(self, model_path: Union[Path, str]) -> None:
+        """ Save the model to a path on disk. """
+        torch.save(self.model.state_dict(), model_path)

-    def load(self, model_name, model_dir="checkpoints"):
+    def load(self, model_name: str, model_dir: str = "checkpoints") -> None:
        """
        TODO accept epoch. If None, load the latest model.
        :param model_name: Model name format should be 'name_0EE' where E is the epoch
        :param model_dir: By default, 'checkpoints'
        :return:
        """
-        self.model.load_state_dict(torch.load(
-            os.path.join(model_dir, "{}.pt".format(model_name))
-        ))
+        self.model.load_state_dict(
+            torch.load(os.path.join(model_dir, f"{model_name}.pt"))
+        )
--- a/utils_cv/action_recognition/references/functional_video.py
+++ b/utils_cv/action_recognition/references/functional_video.py
@ -20,7 +20,7 @@ def crop(clip, i, j, h, w):
        clip (torch.tensor): Video clip to be cropped. Size is (C, T, H, W)
    """
    assert len(clip.size()) == 4, "clip should be a 4D tensor"
-    return clip[..., i:i + h, j:j + w]
+    return clip[..., i : i + h, j : j + w]


 def resize(clip, target_size, interpolation_mode):
@ -53,7 +53,9 @@ def center_crop(clip, crop_size):
    assert _is_tensor_video_clip(clip), "clip should be a 4D torch.tensor"
    h, w = clip.size(-2), clip.size(-1)
    th, tw = crop_size
-    assert h >= th and w >= tw, "height and width must be no smaller than crop_size"
+    assert (
+        h >= th and w >= tw
+    ), "height and width must be no smaller than crop_size"

    i = int(round((h - th) / 2.0))
    j = int(round((w - tw) / 2.0))
@ -71,7 +73,9 @@ def to_tensor(clip):
    """
    assert _is_tensor_video_clip(clip), "clip should be a 4D torch.tensor"
    if not clip.dtype == torch.uint8:
-        raise TypeError("clip tensor should have data type uint8. Got %s" % str(clip.dtype))
+        raise TypeError(
+            "clip tensor should have data type uint8. Got %s" % str(clip.dtype)
+        )
    return clip.float().permute(3, 0, 1, 2) / 255.0


--- a/utils_cv/action_recognition/references/metrics.py
+++ b/utils_cv/action_recognition/references/metrics.py
@ -5,6 +5,7 @@ import torch

 class AverageMeter(object):
    """Computes and stores the average and current value"""
+
    def __init__(self):
        self.reset()

--- a/utils_cv/action_recognition/references/transforms_video.py
+++ b/utils_cv/action_recognition/references/transforms_video.py
@ -37,13 +37,19 @@ class ResizeVideo(object):
                size = (int(self.size), int(self.size))
        else:
            if self.keep_ratio:
-                scale = min(self.size[0] / clip.shape[-2], self.size[1] / clip.shape[-1], )
+                scale = min(
+                    self.size[0] / clip.shape[-2],
+                    self.size[1] / clip.shape[-1],
+                )
            else:
                size = self.size

        return nn.functional.interpolate(
-            clip, size=size, scale_factor=scale,
-            mode=self.interpolation_mode, align_corners=False
+            clip,
+            size=size,
+            scale_factor=scale,
+            mode=self.interpolation_mode,
+            align_corners=False,
        )


@ -66,7 +72,7 @@ class RandomCropVideo(object):
        return F.crop(clip, i, j, h, w)

    def __repr__(self):
-        return self.__class__.__name__ + '(size={0})'.format(self.size)
+        return self.__class__.__name__ + "(size={0})".format(self.size)

    @staticmethod
    def get_params(clip, output_size):
@ -116,13 +122,17 @@ class RandomResizedCropVideo(object):
                size is (C, T, H, W)
        """
        i, j, h, w = self.get_params(clip, self.scale, self.ratio)
-        return F.resized_crop(clip, i, j, h, w, self.size, self.interpolation_mode)
+        return F.resized_crop(
+            clip, i, j, h, w, self.size, self.interpolation_mode
+        )

    def __repr__(self):
-        return self.__class__.__name__ + \
-            '(size={0}, interpolation_mode={1}, scale={2}, ratio={3})'.format(
+        return (
+            self.__class__.__name__
+            + "(size={0}, interpolation_mode={1}, scale={2}, ratio={3})".format(
                self.size, self.interpolation_mode, self.scale, self.ratio
            )
+        )

    @staticmethod
    def get_params(clip, scale, ratio):
@ -187,7 +197,7 @@ class CenterCropVideo(object):
        return F.center_crop(clip, self.size)

    def __repr__(self):
-        return self.__class__.__name__ + '(size={0})'.format(self.size)
+        return self.__class__.__name__ + "(size={0})".format(self.size)


 class NormalizeVideo(object):
@ -212,8 +222,12 @@ class NormalizeVideo(object):
        return F.normalize(clip, self.mean, self.std, self.inplace)

    def __repr__(self):
-        return self.__class__.__name__ + '(mean={0}, std={1}, inplace={2})'.format(
-            self.mean, self.std, self.inplace)
+        return (
+            self.__class__.__name__
+            + "(mean={0}, std={1}, inplace={2})".format(
+                self.mean, self.std, self.inplace
+            )
+        )


 class ToTensorVideo(object):
--- a/utils_cv/action_recognition/video_annotation_utils.py
+++ b/utils_cv/action_recognition/video_annotation_utils.py
@ -70,7 +70,7 @@ def read_classes_file(classes_filepath):
    classes = {}
    with open(classes_filepath) as class_file:
        for line in class_file:
-            class_name, class_id = line.split(' ')
+            class_name, class_id = line.split(" ")
            classes[class_name] = class_id.rstrip()
    return classes

@ -87,7 +87,7 @@ def create_clip_file_name(row, clip_file_format="mp4"):
    :return: str.
        The output clip file name.
    """
-    #video_file = ast.literal_eval(row.file_list)[0]
+    # video_file = ast.literal_eval(row.file_list)[0]
    video_file = os.path.splitext(row["file_list"])[0]
    clip_id = row["# CSV_HEADER = metadata_id"]
    clip_file = "{}_{}.{}".format(video_file, clip_id, clip_file_format)
@ -477,14 +477,14 @@ def extract_contiguous_negative_clips(

        # video_path = os.path.join(video_dir, negative_sample_file)
        video_fname = os.path.splitext(os.path.basename(video_file_path))[0]
-        clip_fname = video_fname+no_action_class+str(i)
+        clip_fname = video_fname + no_action_class + str(i)
        clip_subdir_fname = os.path.join(no_action_class, clip_fname)
        negative_clip_file_list.append(clip_subdir_fname)
        _extract_clip_ffmpeg(
            start_time,
            duration,
            video_file_path,
-            os.path.join(negative_clip_dir, clip_fname+"."+clip_format),
+            os.path.join(negative_clip_dir, clip_fname + "." + clip_format),
            ffmpeg_path,
        )

@ -496,6 +496,7 @@ def extract_contiguous_negative_clips(
        }
    )

+
 def extract_sampled_negative_clips(
    video_info_df,
    num_negative_samples,
@ -548,7 +549,9 @@ def extract_sampled_negative_clips(
    clips_sampled = 0
    while clips_sampled < num_negative_samples:
        # pick random file in list of videos
-        negative_sample_file = video_files[random.randint(0, len(video_files)-1)]
+        negative_sample_file = video_files[
+            random.randint(0, len(video_files) - 1)
+        ]
        # get video duration
        duration = video_len[negative_sample_file]
        # pick random start time for clip
@ -559,15 +562,27 @@ def extract_sampled_negative_clips(
        # check to ensure negative clip doesn't overlap a positive clip or pick another file
        if negative_sample_file in positive_intervals.keys():
            clip_positive_intervals = positive_intervals[negative_sample_file]
-            if check_interval_overlaps(clip_start, clip_end, clip_positive_intervals):
+            if check_interval_overlaps(
+                clip_start, clip_end, clip_positive_intervals
+            ):
                continue
        video_path = os.path.join(video_dir, negative_sample_file)
        video_fname = os.path.splitext(negative_sample_file)[0]
-        clip_fname = video_fname+no_action_class+str(clips_sampled)
+        clip_fname = video_fname + no_action_class + str(clips_sampled)
        clip_subdir_fname = os.path.join(no_action_class, clip_fname)
        _extract_clip_ffmpeg(
-            clip_start, negative_clip_length, video_path, os.path.join(clip_dir, clip_subdir_fname+"."+clip_format),
+            clip_start,
+            negative_clip_length,
+            video_path,
+            os.path.join(clip_dir, clip_subdir_fname + "." + clip_format),
        )
-        with open(label_filepath, 'a') as f:
-            f.write("\""+clip_subdir_fname+"\""+" "+str(classes[no_action_class])+"\n")
+        with open(label_filepath, "a") as f:
+            f.write(
+                '"'
+                + clip_subdir_fname
+                + '"'
+                + " "
+                + str(classes[no_action_class])
+                + "\n"
+            )
        clips_sampled += 1
--- a/utils_cv/classification/data.py
+++ b/utils_cv/classification/data.py
@ -28,7 +28,9 @@ class Urls:
        base, "fridgeObjectsWatermarkTiny.zip"
    )
    fridge_objects_negatives_path = urljoin(base, "fridgeObjectsNegative.zip")
-    fridge_objects_negatives_tiny_path = urljoin(base, "fridgeObjectsNegativeTiny.zip")
+    fridge_objects_negatives_tiny_path = urljoin(
+        base, "fridgeObjectsNegativeTiny.zip"
+    )

    # multilabel datasets
    multilabel_fridge_objects_path = urljoin(
--- a/utils_cv/common/gpu.py
+++ b/utils_cv/common/gpu.py
@ -3,8 +3,10 @@

 import os
 import platform
-
+import sys
 import torch
+import torch.cuda as cuda
+import torchvision
 from torch.cuda import current_device, get_device_name, is_available


@ -47,6 +49,15 @@ def torch_device():
    )


+def num_devices():
+    """ Gets the number of devices based on cpu/gpu """
+    return (
+        torch.cuda.device_count()
+        if torch.cuda.is_available()
+        else 1
+    )
+
+
 def db_num_workers(non_windows_num_workers: int = 16):
    """Returns how many workers to use when loading images in a databunch. On windows machines using >0 works significantly slows down model
    training and evaluation. Setting num_workers to zero on Windows machines will speed up training/inference significantly, but will still be
@ -58,3 +69,15 @@ def db_num_workers(non_windows_num_workers: int = 16):
        return 0
    else:
        return non_windows_num_workers
+
+
+def system_info():
+    print(sys.version, "\n")
+    print(f"PyTorch {torch.__version__} \n")
+    print(f"Torch-vision {torchvision.__version__} \n")
+    print("Available devices:")
+    if cuda.is_available():
+        for i in range(cuda.device_count()):
+            print(f"{i}: {cuda.get_device_name(i)}")
+    else:
+        print("CPUs only, no GPUs found")
--- a/utils_cv/common/misc.py
+++ b/utils_cv/common/misc.py
@ -83,3 +83,34 @@ def get_font(size: int = 12) -> ImageFont:
            font = None

    return font
+
+
+class Config(object):
+    def __init__(self, config=None, **extras):
+        """Dictionary wrapper to access keys as attributes.
+
+        Args:
+            config (dict or Config): Configurations
+            extras (kwargs): Extra configurations
+
+        Examples:
+            >>> cfg = Config({'lr': 0.01}, momentum=0.95)
+            or
+            >>> cfg = Config({'lr': 0.01, 'momentum': 0.95})
+            then, use as follows:
+            >>> print(cfg.lr, cfg.momentum)
+        """
+        if config is not None:
+            if isinstance(config, dict):
+                for k in config:
+                    setattr(self, k, config[k])
+            elif isinstance(config, self.__class__):
+                self.__dict__ = config.__dict__.copy()
+            else:
+                raise ValueError("Unknown config")
+
+        for k, v in extras.items():
+            setattr(self, k, v)
+
+    def get(self, key, default):
+        return getattr(self, key, default)
--- a/utils_cv/detection/references/coco_eval.py
+++ b/utils_cv/detection/references/coco_eval.py
@ -109,7 +109,13 @@ class CocoEvaluator(object):
            labels = prediction["labels"].tolist()

            rles = [
-                mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order="F"))[0]
+                mask_util.encode(
+                    # Change according to the issue related to mask:
+                    #     https://github.com/pytorch/vision/issues/1355#issuecomment-544951911
+                    np.array(
+                        mask[0, :, :, np.newaxis], dtype=np.uint8, order="F"
+                    )
+                )[0]
                for mask in masks
            ]
            for rle in rles: