Merge RNNPool Codes (#201)

* added bidirectional * bidirectional in BaseRNN * updated all rnn for new bidirectional * debugging for fastgrnncuda * fix for fastgrnncuda * visual wakeword * visual wakewords evaluation * visual wakeword evaluation * updated readme for eval * face detection * update face detection * rnn edit * test update * model loading change * update readme * update eval tools in readme * update train * readme * readme * readme * requirements * requirements * readme * rnn * update s3fd_net * update s3fd_net * update s3fd_net * train * train * add additional args * add additional args * arg changes in wider_test * added arg for using new ckpt * readme * remove old modelsupport * readme * readme * requirements * readme * eval on all format * support for calculating MAP * update readme * Update README.md * readme * readme * readme * readme * fix for warnings * readme * readme scores * add dump weights and traces support * readme * remove eval warnings * eval remove import warnings * readme changes * readme changes * support for qvga monochrome * readme update * readme update * Update README.md * readme update * environment key update * config files * update both config files text * change architecture * readme update * quantized cpp rnnpool * Update README.md * Update README.md * Update README.md * smaller model for qvga * Update RPool_Face_QVGA_monochrome.py * update to ssd code * update to init * update to init * update to dataloader * tf code for face detection * tf code for face detection * add tf face detection code * eval file * fix weights and detect function * Delete factory.py * Update RPool_Face_QVGA_monochrome.py * Update RPool_Face_QVGA_monochrome.py * Update RPool_Face_C.py * Update RPool_Face_Quant.py * Update augmentations.py * Update detection.py * Update eval.py * vww updates * removed tf code * Update fastcell_example.py * Update widerface.py * Update rnnpool.py * Update fastTrainer.py * Update fastTrainer.py * Update fastTrainer.py * Update model_mobilenet_2rnnpool.py * Update model_mobilenet_rnnpool.py * Delete top_level.txt * Delete dependency_links.txt * remove egg-info * Update fastcell_example.py * file copyright edits * Update README.md * delete output blank file * remove input trace file Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> Co-authored-by: Ubuntu <harshasi@GPUnode1.n14uw44gbsdu3cvfvcchcfucod.xx.internal.cloudapp.net> Co-authored-by: Harsha Vardhan Simhadri <harshasi@microsoft.com>
2020-10-03 01:44:20 +05:30 · 2020-10-03 01:44:20 +05:30 · 373a7a14ee
--- a/examples/pytorch/vision/Face_Detection/README.md
+++ b/examples/pytorch/vision/Face_Detection/README.md
@ -0,0 +1,146 @@
 # Code for Face Detection experiments with RNNPool
 ## Requirements
 1. Follow instructions to install requirements for EdgeML operators and the EdgeML operators [here](https://github.com/microsoft/EdgeML/blob/master/pytorch/README.md).
 2. Install requirements for face detection model using
 ``` pip install -r requirements.txt ``` 
 We have tested the installation and the code on Ubuntu 18.04 with Cuda 10.2 and CuDNN 7.6
 ## Dataset
 1. Download WIDER face dataset images and annotations from http://shuoyang1213.me/WIDERFACE/ and place them all in a folder with name 'WIDER_FACE'. That is, download WIDER_train.zip, WIDER_test.zip, WIDER_val.zip, wider_face_split.zip and place it in WIDER_FACE folder, and unzip files using: 
 ```shell
 cd WIDER_FACE
 unzip WIDER_train.zip
 unzip WIDER_test.zip
 unzip WIDER_val.zip
 unzip wider_face_split.zip
 cd ..
 ```
 2. In `data/config.py` , set _C.HOME to the parent directory of the above folder, and set the _C.FACE.WIDER_DIR to the folder path. 
 That is, if the WIDER_FACE folder is created in /mnt folder, then _C.HOME='/mnt'
 _C.FACE.WIDER_DIR='/mnt/WIDER_FACE'.
 Similarly, change `data/config_qvga.py` to set _C.HOME and _C.FACE.WIDER_DIR.
 3. Run
 ``` python prepare_wider_data.py ```
 # Usage
 ## Training
 ```shell
 IS_QVGA_MONO=0 python train.py --batch_size 32 --model_arch RPool_Face_Quant --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 
 ```
 For QVGA:
 ```shell
 IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_QVGA_monochrome --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 
 ```
 This will save checkpoints after every '--save_frequency' number of iterations in a weight file with 'checkpoint.pth' at the end and weights for the best state in a file with 'best_state.pth' at the end. These will be saved in '--save_folder'. For resuming training from a checkpoint, use '--resume <checkpoint_name>.pth' with the above command. For example, 
 ```shell
 IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_QVGA_monochrome --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 --resume <checkpoint_name>.pth
 ```
 If IS_QVGA_MONO is 0 then training input images will be 640x640 and RGB. 
 If IS_QVGA_MONO is 1 then training input images will be 320x320 and converted to monochrome. 
 Input images for training models are cropped and reshaped to square to maintain consistency with [S3FD](https://arxiv.org/abs/1708.05237). However testing can be done on any size of images, thus we resize testing input image size to have area equal to VGA (640x480)/QVGA (320x240), so that aspect ratio is not changed.
 The architecture RPool_Face_QVGA_monochrome is for QVGA monochrome format while RPool_Face_C and RPool_Face_Quant are for VGA RGB format.
 ## Test
 There are two modes of testing the trained model -- the evaluation mode to generate bounding boxes for a set of sample images, and the test mode to compute statistics like mAP scores.
 #### Evaluation Mode
 Given a set of images in <your_image_folder>, `eval/py` generates bounding boxes around faces (where the confidence is higher than certain threshold) and write the images in <your_save_folder>. To evaluate the `rpool_face_best_state.pth` model (stored in ./weights), execute the following command: 
 ```shell
 IS_QVGA_MONO=0 python eval.py --model_arch RPool_Face_Quant --model ./weights/RPool_Face_Quant_best_state.pth --image_folder <your_image_folder> --save_dir <your_save_folder>
 ```
 For QVGA:
 ```shell
 IS_QVGA_MONO=1 python eval.py --model_arch RPool_Face_QVGA_monochrome --model ./weights/RPool_Face_QVGA_monochrome_best_state.pth --image_folder <your_image_folder> --save_dir <your_save_folder>
 ```
 This will save images in <your_save_folder> with bounding boxes around faces, where the confidence is high. Here is an example image with a single bounding box.
 ![Camera: Himax0360](imrgb20ft.png)
 If IS_QVGA_MONO=0 the evaluation code accepts an image of any size and resizes it to 640x480x3 while preserving original image aspect ratio.
 If IS_QVGA_MONO=1 the evaluation code accepts an image of any size and resizes and converts it to monochrome to make it 320x240x1 while preserving original image aspect ratio.
 #### WIDER Set Test
 In this mode, we test the generated model against the provided WIDER_FACE validation and test dataset. 
 For this, first run the following to generate predictions of the model and store output in the '--save_folder' folder. 
 ```shell
 IS_QVGA_MONO=0 python wider_test.py --model_arch RPool_Face_Quant --model ./weights/RPool_Face_Quant_best_state.pth --save_folder rpool_face_quant_val --subset val
 ```
 For QVGA:
 ```shell
 IS_QVGA_MONO=1 python wider_test.py --model_arch RPool_Face_QVGA_monochrome --model ./weights/RPool_Face_QVGA_monochrome_best_state.pth --save_folder rpool_face_qvgamono_val --subset val
 ```
 The above command generates predictions for each image in the "validation" dataset. For each image, a separate prediction file is provided (image_name.txt file in appropriate folder). The first line of the prediction file contains the total number of boxes identified. 
 Then each line in the file corresponds to an identified box. For each box, five numbers are generated: length of the box, height of the box, x-axis offset, y-axis offset, confidence value for presence of a face in the box. 
 If IS_QVGA_MONO=1 then testing is done by converting images to monochrome and QVGA, else if IS_QVGA_MONO=0 then testing is done on VGA RGB images.
 The architecture RPool_Face_QVGA_monochrome is for QVGA monochrome format while RPool_Face_C and RPool_Face_Quant are for VGA RGB format.
 ###### For calculating MAP scores:
 Now using these boxes, we can compute the standard MAP score that is widely used in this literature (see [here](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173) for more details) as follows:
 1. Download eval_tools.zip from http://shuoyang1213.me/WIDERFACE/support/eval_script/eval_tools.zip and unzip in a folder of same name in this directory.
 Example code: 
 ```shell
 wget http://shuoyang1213.me/WIDERFACE/support/eval_script/eval_tools.zip 
 unzip eval_tools.zip
 ```
 2. Set up scripts to use the Matlab '.mat' data files in eval_tools/ground_truth folder for MAP calculation: The following installs python files that provide the same functionality as the '.m' matlab scripts in eval_tools folder.
 ``` 
 cd eval_tools
 git clone https://github.com/wondervictor/WiderFace-Evaluation.git
 cd WiderFace-Evaluation 
 python3 setup.py build_ext --inplace
 ```
 3. Run ```python3 evaluation.py -p <your_save_folder> -g <groud truth dir>``` in WiderFace-Evaluation folder
 where `prediction_dir` is the '--save_folder' used for `wider_test.py` above and <groud truth dir> is the subfolder `eval_tools/ground_truth`. That is in, WiderFace-Evaluation directory, run: 
 ```shell
 python3 evaluation.py -p <your_save_folder> -g ../ground_truth
 ```
 This script should output the MAP for the WIDER-easy, WIDER-medium, and WIDER-hard subsets of the dataset. Our best performance using RPool_Face_Quant model is: 0.80 (WIDER-easy), 0.78 (WIDER-medium), 0.53 (WIDER-hard). 
 ##### Dump RNNPool Input Output Traces and Weights
 To save model weights and/or input output pairs for each patch through RNNPool in numpy format use the command below. Put images which you want to save traces for in <your_image_folder> . Specify output folder for saving model weights in numpy format in <your_save_model_numpy_folder>. Specify output folder for saving input output traces of RNNPool in numpy format in <your_save_traces_numpy_folder>. Note that input traces will be saved in a folder named 'inputs' and output traces in a folder named 'outputs' inside <your_save_traces_numpy_folder>.
 ```shell
 python3 dump_model.py --model ./weights/RPool_Face_QVGA_monochrome_best_state.pth --model_arch RPool_Face_Quant --image_folder <your_image_folder> --save_model_npy_dir <your_save_model_numpy_folder> --save_traces_npy_dir <your_save_traces_numpy_folder>
 ```
 If you wish to save only model weights, do not specify --save_traces_npy_dir. If you wish to save only traces do not specify --save_model_npy_dir.
 Code has been built upon https://github.com/yxlijun/S3FD.pytorch
--- a/examples/pytorch/vision/Face_Detection/data/init.py
+++ b/examples/pytorch/vision/Face_Detection/data/init.py
@ -0,0 +1,31 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from .widerface import WIDERDetection
 from data.choose_config import cfg
 cfg = cfg.cfg
 import torch
 def detection_collate(batch):
    """Custom collate fn for dealing with batches of images that have a different
    number of associated object annotations (bounding boxes).
    Arguments:
        batch: (tuple) A tuple of tensor images and lists of annotations
    Return:
        A tuple containing:
            1) (tensor) batch of images stacked on their 0 dim
            2) (list of tensors) annotations for a given image are stacked on
                                 0 dim
    """
    targets = []
    imgs = []
    for sample in batch:
        imgs.append(sample[0])
        targets.append(torch.FloatTensor(sample[1]))
    return torch.stack(imgs, 0), targets
--- a/examples/pytorch/vision/Face_Detection/data/choose_config.py
+++ b/examples/pytorch/vision/Face_Detection/data/choose_config.py
@ -0,0 +1,15 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import os
 from importlib import import_module
 IS_QVGA_MONO = os.environ['IS_QVGA_MONO']
 name = 'config'
 if IS_QVGA_MONO == '1':
 	name = name + '_qvga'
 cfg = import_module('data.' + name)
--- a/examples/pytorch/vision/Face_Detection/data/config.py
+++ b/examples/pytorch/vision/Face_Detection/data/config.py
@ -0,0 +1,65 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import os
 from easydict import EasyDict
 import numpy as np
 _C = EasyDict()
 cfg = _C
 # data augument config
 _C.expand_prob = 0.5
 _C.expand_max_ratio = 4
 _C.hue_prob = 0.5
 _C.hue_delta = 18
 _C.contrast_prob = 0.5
 _C.contrast_delta = 0.5
 _C.saturation_prob = 0.5
 _C.saturation_delta = 0.5
 _C.brightness_prob = 0.5
 _C.brightness_delta = 0.125
 _C.data_anchor_sampling_prob = 0.5
 _C.min_face_size = 6.0
 _C.apply_distort = True
 _C.apply_expand = False
 _C.img_mean = np.array([104., 117., 123.])[:, np.newaxis, np.newaxis].astype(
    'float32')
 _C.resize_width = 640
 _C.resize_height = 640
 _C.scale = 1 / 127.0
 _C.anchor_sampling = True
 _C.filter_min_face = True
 _C.IS_MONOCHROME = False
 # anchor config
 _C.FEATURE_MAPS = [160, 80, 40, 20, 10, 5]
 _C.INPUT_SIZE = 640
 _C.STEPS = [4, 8, 16, 32, 64, 128]
 _C.ANCHOR_SIZES = [16, 32, 64, 128, 256, 512]
 _C.CLIP = False
 _C.VARIANCE = [0.1, 0.2]
 # detection config
 _C.NMS_THRESH = 0.3
 _C.NMS_TOP_K = 5000
 _C.TOP_K = 750
 _C.CONF_THRESH = 0.01
 # loss config
 _C.NEG_POS_RATIOS = 3
 _C.NUM_CLASSES = 2
 _C.USE_NMS = True
 # dataset config
 _C.HOME = '/mnt/'  ## change here ----------
 # face config
 _C.FACE = EasyDict()
 _C.FACE.TRAIN_FILE = './data/face_train.txt'
 _C.FACE.VAL_FILE = './data/face_val.txt'
 _C.FACE.WIDER_DIR = '/mnt/WIDER_FACE'  ## change here ---------
 _C.FACE.OVERLAP_THRESH = [0.1, 0.35, 0.5]
--- a/examples/pytorch/vision/Face_Detection/data/config_qvga.py
+++ b/examples/pytorch/vision/Face_Detection/data/config_qvga.py
@ -0,0 +1,64 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import os
 from easydict import EasyDict
 import numpy as np
 _C = EasyDict()
 cfg = _C
 # data augument config
 _C.expand_prob = 0.5
 _C.expand_max_ratio = 2
 _C.hue_prob = 0.5
 _C.hue_delta = 18
 _C.contrast_prob = 0.5
 _C.contrast_delta = 0.5
 _C.saturation_prob = 0.5
 _C.saturation_delta = 0.5
 _C.brightness_prob = 0.5
 _C.brightness_delta = 0.125
 _C.data_anchor_sampling_prob = 0.5
 _C.min_face_size = 1.0
 _C.apply_distort = True
 _C.apply_expand = False
 _C.img_mean = np.array([104., 117., 123.])[:, np.newaxis, np.newaxis].astype(
    'float32')
 _C.resize_width = 320
 _C.resize_height = 320
 _C.scale = 1 / 127.0
 _C.anchor_sampling = True
 _C.filter_min_face = True
 _C.IS_MONOCHROME = True
 # anchor config
 _C.FEATURE_MAPS = [40, 40, 20, 20]
 _C.INPUT_SIZE = 320
 _C.STEPS = [8, 8, 16, 16]
 _C.ANCHOR_SIZES = [8, 16, 32, 48]
 _C.CLIP = False
 _C.VARIANCE = [0.1, 0.2]
 # detection config
 _C.NMS_THRESH = 0.3
 _C.NMS_TOP_K = 5000
 _C.TOP_K = 750
 _C.CONF_THRESH = 0.05
 # loss config
 _C.NEG_POS_RATIOS = 3
 _C.NUM_CLASSES = 2
 _C.USE_NMS = True
 # dataset config
 _C.HOME = '/mnt/'
 # face config
 _C.FACE = EasyDict()
 _C.FACE.TRAIN_FILE = './data/face_train.txt'
 _C.FACE.VAL_FILE = './data/face_val.txt'
 _C.FACE.WIDER_DIR = '/mnt/WIDER_FACE'
 _C.FACE.OVERLAP_THRESH = [0.1, 0.35, 0.5]
--- a/examples/pytorch/vision/Face_Detection/data/widerface.py
+++ b/examples/pytorch/vision/Face_Detection/data/widerface.py
@ -0,0 +1,115 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 from PIL import Image, ImageDraw
 import torch.utils.data as data
 import numpy as np
 import random
 import sys; sys.path.append('../')
 from utils.augmentations import preprocess
 class WIDERDetection(data.Dataset):
    """docstring for WIDERDetection"""
    def __init__(self, list_file, mode='train', mono_mode=False):
        super(WIDERDetection, self).__init__()
        self.mode = mode
        self.mono_mode = mono_mode
        self.fnames = []
        self.boxes = []
        self.labels = []
        with open(list_file) as f:
            lines = f.readlines()
        for line in lines:
            line = line.strip().split()
            num_faces = int(line[1])
            box = []
            label = []
            for i in range(num_faces):
                x = float(line[2 + 5 * i])
                y = float(line[3 + 5 * i])
                w = float(line[4 + 5 * i])
                h = float(line[5 + 5 * i])
                c = int(line[6 + 5 * i])
                if w <= 0 or h <= 0:
                    continue
                box.append([x, y, x + w, y + h])
                label.append(c)
            if len(box) > 0:
                self.fnames.append(line[0])
                self.boxes.append(box)
                self.labels.append(label)
        self.num_samples = len(self.boxes)
    def __len__(self):
        return self.num_samples
    def __getitem__(self, index):
        img, target, h, w = self.pull_item(index)
        return img, target
    def pull_item(self, index):
        while True:
            image_path = self.fnames[index]
            img = Image.open(image_path)
            if img.mode == 'L':
                img = img.convert('RGB')
            im_width, im_height = img.size
            boxes = self.annotransform(
                np.array(self.boxes[index]), im_width, im_height)
            label = np.array(self.labels[index])
            bbox_labels = np.hstack((label[:, np.newaxis], boxes)).tolist()
            img, sample_labels = preprocess(
                img, bbox_labels, self.mode, image_path)
            sample_labels = np.array(sample_labels)
            if len(sample_labels) > 0:
                target = np.hstack(
                    (sample_labels[:, 1:], sample_labels[:, 0][:, np.newaxis]))
                assert (target[:, 2] > target[:, 0]).any()
                assert (target[:, 3] > target[:, 1]).any()
                break 
            else:
                index = random.randrange(0, self.num_samples)
        if self.mono_mode==True:
            im = 0.299 * img[0] + 0.587 * img[1] + 0.114 * img[2]
            return torch.from_numpy(np.expand_dims(im,axis=0)), target, im_height, im_width
        return torch.from_numpy(img), target, im_height, im_width
    def annotransform(self, boxes, im_width, im_height):
        boxes[:, 0] /= im_width
        boxes[:, 1] /= im_height
        boxes[:, 2] /= im_width
        boxes[:, 3] /= im_height
        return boxes
 def detection_collate(batch):
    """Custom collate fn for dealing with batches of images that have a different
    number of associated object annotations (bounding boxes).
    Arguments:
        batch: (tuple) A tuple of tensor images and lists of annotations
    Return:
        A tuple containing:
            1) (tensor) batch of images stacked on their 0 dim
            2) (list of tensors) annotations for a given image are stacked on
                                 0 dim
    """
    targets = []
    imgs = []
    for sample in batch:
        imgs.append(sample[0])
        targets.append(torch.FloatTensor(sample[1]))
    return torch.stack(imgs, 0), targets
--- a/examples/pytorch/vision/Face_Detection/dump_model.py
+++ b/examples/pytorch/vision/Face_Detection/dump_model.py
@ -0,0 +1,191 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import os
 import torch
 import argparse
 import torch.nn as nn
 import torch.utils.data as data
 import torch.backends.cudnn as cudnn
 import torchvision.transforms as transforms
 import cv2
 import time
 import numpy as np
 from PIL import Image, ImageFilter
 from data.config import cfg
 from torch.autograd import Variable
 from utils.augmentations import to_chw_bgr
 from importlib import import_module
 import warnings
 warnings.filterwarnings("ignore")
 parser = argparse.ArgumentParser(description='face detection dump')
 parser.add_argument('--model', type=str,
                    default='weights/rpool_face_c.pth', help='trained model')
                    #small_fgrnn_smallram_sd.pth', help='trained model')
 parser.add_argument('--model_arch',
                    default='RPool_Face_C', type=str,
                    choices=['RPool_Face_C', 'RPool_Face_B', 'RPool_Face_A', 'RPool_Face_Quant'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--image_folder', default=None, type=str, help='folder containing images')
 parser.add_argument('--save_model_npy_dir', default=None, type=str, help='Directory for saving model in numpy array format')
 parser.add_argument('--save_traces_npy_dir', default=None, type=str, help='Directory for saving RNNPool input and output traces in numpy array format')
 args = parser.parse_args()
 use_cuda = torch.cuda.is_available()
 if use_cuda:
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
 else:
    torch.set_default_tensor_type('torch.FloatTensor')
 def saveModelNpy(net):
    if os.path.isdir(args.save_model_npy_dir) is False:
        try:
            os.mkdir(args.save_model_npy_dir)
        except OSError:
            print("Creation of the directory %s failed" % args.save_model_npy_dir)
            return
    np.save(args.save_model_npy_dir+'/W1.npy', net.rnn_model.cell_rnn.cell.W.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/W2.npy', net.rnn_model.cell_bidirrnn.cell.W.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/U1.npy', net.rnn_model.cell_rnn.cell.U.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/U2.npy', net.rnn_model.cell_bidirrnn.cell.U.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/Bg1.npy', net.rnn_model.cell_rnn.cell.bias_gate.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/Bg2.npy', net.rnn_model.cell_bidirrnn.cell.bias_gate.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/Bh1.npy', net.rnn_model.cell_rnn.cell.bias_update.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/Bh2.npy', net.rnn_model.cell_bidirrnn.cell.bias_update.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/nu1.npy', net.rnn_model.cell_rnn.cell.nu.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/nu2.npy', net.rnn_model.cell_bidirrnn.cell.nu.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/zeta1.npy', net.rnn_model.cell_rnn.cell.zeta.cpu().detach().numpy())
    np.save(args.save_model_npy_dir+'/zeta2.npy', net.rnn_model.cell_bidirrnn.cell.zeta.cpu().detach().numpy())
 activation = {}
 def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook
 def saveTracesNpy(net, img_list):
    if os.path.isdir(args.save_traces_npy_dir) is False:
        try:
            os.mkdir(args.save_traces_npy_dir)
        except OSError:
            print("Creation of the directory %s failed" % args.save_traces_npy_dir)
            return
    if os.path.isdir(os.path.join(args.save_traces_npy_dir,'inputs')) is False:
        try:
            os.mkdir(os.path.join(args.save_traces_npy_dir,'inputs'))
        except OSError:
            print("Creation of the directory %s failed" % os.path.join(args.save_traces_npy_dir,'inputs'))
            return
    if os.path.isdir(os.path.join(args.save_traces_npy_dir,'outputs')) is False:
        try:
            os.mkdir(os.path.join(args.save_traces_npy_dir,'outputs'))
        except OSError:
            print("Creation of the directory %s failed" % os.path.join(args.save_traces_npy_dir,'outputs'))
            return
    inputDims = net.rnn_model.inputDims
    nRows = net.rnn_model.nRows
    nCols = net.rnn_model.nCols
    count=0
    for img_path in img_list:
        img = Image.open(os.path.join(args.image_folder, img_path))
        img = img.convert('RGB')
        img = np.array(img)
        max_im_shrink = np.sqrt(
            640 * 480 / (img.shape[0] * img.shape[1]))
        image = cv2.resize(img, None, None, fx=max_im_shrink,
                          fy=max_im_shrink, interpolation=cv2.INTER_LINEAR)
        x = to_chw_bgr(image)
        x = x.astype('float32')
        x -= cfg.img_mean
        x = x[[2, 1, 0], :, :]
        x = Variable(torch.from_numpy(x).unsqueeze(0))
        if use_cuda:
            x = x.cuda()
        t1 = time.time()
        y = net(x)
        patches = activation['prepatch']
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,inputDims,nRows,nCols))
        rnnX = activation['rnn_model']
        patches_all = torch.stack(torch.split(patches, split_size_or_sections=1, dim=0),dim=-1)
        rnnX_all = torch.stack(torch.split(rnnX, split_size_or_sections=1, dim=0),dim=-1)
        for k in range(patches_all.shape[-1]):
            patches_tosave = patches_all[0,:,:,:,k].cpu().numpy().transpose(1,2,0)
            rnnX_tosave = rnnX_all[0,:,k].cpu().numpy()
            np.save(args.save_traces_npy_dir+'/inputs/trace_'+str(count)+'_'+str(k)+'.npy', patches_tosave)
            np.save(args.save_traces_npy_dir+'/outputs/trace_'+str(count)+'_'+str(k)+'.npy', rnnX_tosave)
        count+=1
 if __name__ == '__main__':
    module = import_module('models.' + args.model_arch)
    net = module.build_s3fd('test', cfg.NUM_CLASSES)
    # net = torch.nn.DataParallel(net)
    checkpoint_dict = torch.load(args.model)
    model_dict = net.state_dict()
    model_dict.update(checkpoint_dict) 
    net.load_state_dict(model_dict)
    net.eval()
    if use_cuda:
        net.cuda()
        cudnn.benckmark = True
    if args.save_model_npy_dir is not None:
        saveModelNpy(net)
    if args.save_traces_npy_dir is not None:
        net.unfold.register_forward_hook(get_activation('prepatch'))     
        net.rnn_model.register_forward_hook(get_activation('rnn_model'))  
        img_path = args.image_folder
        img_list = [os.path.join(img_path, x)
                for x in os.listdir(img_path)]
        saveTracesNpy(net, img_list)
--- a/examples/pytorch/vision/Face_Detection/eval.py
+++ b/examples/pytorch/vision/Face_Detection/eval.py
@ -0,0 +1,133 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 import torch.nn as nn
 import torch.utils.data as data
 import torch.backends.cudnn as cudnn
 import torchvision.transforms as transforms
 import os
 import time
 import argparse
 import numpy as np
 from PIL import Image
 import cv2
 from data.choose_config import cfg
 cfg = cfg.cfg
 from utils.augmentations import to_chw_bgr
 from importlib import import_module
 parser = argparse.ArgumentParser(description='face detection demo')
 parser.add_argument('--save_dir', type=str, default='results/',
                    help='Directory for detect result')
 parser.add_argument('--model', type=str,
                    default='weights/rpool_face_c.pth', help='trained model')
 parser.add_argument('--thresh', default=0.17, type=float,
                    help='Final confidence threshold')
 parser.add_argument('--model_arch',
                    default='RPool_Face_C', type=str,
                    choices=['RPool_Face_C', 'RPool_Face_Quant', 'RPool_Face_QVGA_monochrome'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--image_folder', default=None, type=str, help='folder containing images')
 args = parser.parse_args()
 if not os.path.exists(args.save_dir):
    os.makedirs(args.save_dir)
 use_cuda = torch.cuda.is_available()
 if use_cuda:
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
 else:
    torch.set_default_tensor_type('torch.FloatTensor')
 def detect(net, img_path, thresh):
    img = Image.open(img_path)
    img = img.convert('RGB')
    img = np.array(img)
    height, width, _ = img.shape
    if os.environ['IS_QVGA_MONO'] == '1':
        max_im_shrink = np.sqrt(
            320 * 240 / (img.shape[0] * img.shape[1]))
    else:
        max_im_shrink = np.sqrt(
            640 * 480 / (img.shape[0] * img.shape[1]))
    image = cv2.resize(img, None, None, fx=max_im_shrink,
                      fy=max_im_shrink, interpolation=cv2.INTER_LINEAR)
    # img = cv2.resize(img, (640, 640))
    x = to_chw_bgr(image)
    x = x.astype('float32')
    x -= cfg.img_mean
    x = x[[2, 1, 0], :, :]
    if cfg.IS_MONOCHROME == True:
        x = 0.299 * x[0] + 0.587 * x[1] + 0.114 * x[2]
        x = torch.from_numpy(x).unsqueeze(0).unsqueeze(0)
    else:
        x = torch.from_numpy(x).unsqueeze(0)
    if use_cuda:
        x = x.cuda()
    t1 = time.time()
    y = net(x)
    detections = y.data
    scale = torch.Tensor([img.shape[1], img.shape[0],
                          img.shape[1], img.shape[0]])
    img = cv2.imread(img_path, cv2.IMREAD_COLOR)
    for i in range(detections.size(1)):
        j = 0
        while detections[0, i, j, 0] >= thresh:
            score = detections[0, i, j, 0]
            pt = (detections[0, i, j, 1:] * scale).cpu().numpy()
            left_up, right_bottom = (pt[0], pt[1]), (pt[2], pt[3])
            j += 1
            cv2.rectangle(img, left_up, right_bottom, (0, 0, 255), 2)
            conf = "{:.3f}".format(score)
            point = (int(left_up[0]), int(left_up[1] - 5))
            cv2.putText(img, conf, point, cv2.FONT_HERSHEY_COMPLEX,
                       0.6, (0, 255, 0), 1)
    t2 = time.time()
    print('detect:{} timer:{}'.format(img_path, t2 - t1))
    cv2.imwrite(os.path.join(args.save_dir, os.path.basename(img_path)), img)
 if __name__ == '__main__':
    module = import_module('models.' + args.model_arch)
    net = module.build_s3fd('test', cfg.NUM_CLASSES)
    net = torch.nn.DataParallel(net)
    checkpoint_dict = torch.load(args.model)
    model_dict = net.state_dict()
    model_dict.update(checkpoint_dict) 
    net.load_state_dict(model_dict)
    net.eval()
    if use_cuda:
        net.cuda()
        cudnn.benckmark = True
    img_path = args.image_folder
    img_list = [os.path.join(img_path, x)
                for x in os.listdir(img_path)]
    for path in img_list:
        detect(net, path, args.thresh)
--- a/examples/pytorch/vision/Face_Detection/imrgb20ft.png
+++ b/examples/pytorch/vision/Face_Detection/imrgb20ft.png
--- a/examples/pytorch/vision/Face_Detection/layers/init.py
+++ b/examples/pytorch/vision/Face_Detection/layers/init.py
@ -0,0 +1,5 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from .functions import *
 from .modules import *
--- a/examples/pytorch/vision/Face_Detection/layers/bbox_utils.py
+++ b/examples/pytorch/vision/Face_Detection/layers/bbox_utils.py
@ -0,0 +1,306 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 def point_form(boxes):
    """ Convert prior_boxes to (xmin, ymin, xmax, ymax)
    representation for comparison to point form ground truth data.
    Args:
        boxes: (tensor) center-size default boxes from priorbox layers.
    Return:
        boxes: (tensor) Converted xmin, ymin, xmax, ymax form of boxes.
    """
    return torch.cat((boxes[:, :2] - boxes[:, 2:] / 2,     # xmin, ymin
                      boxes[:, :2] + boxes[:, 2:] / 2), 1)  # xmax, ymax
 def center_size(boxes):
    """ Convert prior_boxes to (cx, cy, w, h)
    representation for comparison to center-size form ground truth data.
    Args:
        boxes: (tensor) point_form boxes
    Return:
        boxes: (tensor) Converted xmin, ymin, xmax, ymax form of boxes.
    """
    return torch.cat([(boxes[:, 2:] + boxes[:, :2]) / 2,  # cx, cy
                     boxes[:, 2:] - boxes[:, :2]], 1)  # w, h
 def intersect(box_a, box_b):
    """ We resize both tensors to [A,B,2] without new malloc:
    [A,2] -> [A,1,2] -> [A,B,2]
    [B,2] -> [1,B,2] -> [A,B,2]
    Then we compute the area of intersect between box_a and box_b.
    Args:
      box_a: (tensor) bounding boxes, Shape: [A,4].
      box_b: (tensor) bounding boxes, Shape: [B,4].
    Return:
      (tensor) intersection area, Shape: [A,B].
    """
    A = box_a.size(0)
    B = box_b.size(0)
    max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),
                       box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
    min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),
                       box_b[:, :2].unsqueeze(0).expand(A, B, 2))
    inter = torch.clamp((max_xy - min_xy), min=0)
    return inter[:, :, 0] * inter[:, :, 1]
 def jaccard(box_a, box_b):
    """Compute the jaccard overlap of two sets of boxes.  The jaccard overlap
    is simply the intersection over union of two boxes.  Here we operate on
    ground truth boxes and default boxes.
    E.g.:
        A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)
    Args:
        box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]
        box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]
    Return:
        jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]
    """
    inter = intersect(box_a, box_b)
    area_a = ((box_a[:, 2] - box_a[:, 0]) *
              (box_a[:, 3] - box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]
    area_b = ((box_b[:, 2] - box_b[:, 0]) *
              (box_b[:, 3] - box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]
    union = area_a + area_b - inter
    return inter / union  # [A,B]
 def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    """Match each prior box with the ground truth box of the highest jaccard
    overlap, encode the bounding boxes, then return the matched indices
    corresponding to both confidence and location preds.
    Args:
        threshold: (float) The overlap threshold used when mathing boxes.
        truths: (tensor) Ground truth boxes, Shape: [num_obj, num_priors].
        priors: (tensor) Prior boxes from priorbox layers, Shape: [n_priors,4].
        variances: (tensor) Variances corresponding to each prior coord,
            Shape: [num_priors, 4].
        labels: (tensor) All the class labels for the image, Shape: [num_obj].
        loc_t: (tensor) Tensor to be filled w/ endcoded location targets.
        conf_t: (tensor) Tensor to be filled w/ matched indices for conf preds.
        idx: (int) current batch index
    Return:
        The matched indices corresponding to 1)location and 2)confidence preds.
    """
    # jaccard index
    overlaps = jaccard(
        truths,
        point_form(priors)
    )
    # (Bipartite Matching)
    # [1,num_objects] best prior for each ground truth
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)
    # [1,num_priors] best ground truth for each prior
    best_truth_overlap, best_truth_idx = overlaps.max(
        0, keepdim=True)  # 0-2000
    best_truth_idx.squeeze_(0)
    best_truth_overlap.squeeze_(0)
    best_prior_idx.squeeze_(1)
    best_prior_overlap.squeeze_(1)
    best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior
    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
    _th1, _th2, _th3 = threshold  # _th1 = 0.1 ,_th2 = 0.35,_th3 = 0.5
    N = (torch.sum(best_prior_overlap >= _th2) +
         torch.sum(best_prior_overlap >= _th3)) // 2
    matches = truths[best_truth_idx]          # Shape: [num_priors,4]
    conf = labels[best_truth_idx]         # Shape: [num_priors]
    conf[best_truth_overlap < _th2] = 0  # label as background
    best_truth_overlap_clone = best_truth_overlap.clone()
    add_idx = best_truth_overlap_clone.gt(
        _th1).eq(best_truth_overlap_clone.lt(_th2))
    best_truth_overlap_clone[~add_idx] = 0
    stage2_overlap, stage2_idx = best_truth_overlap_clone.sort(descending=True)
    stage2_overlap = stage2_overlap.gt(_th1)
    if N > 0:
        N = torch.sum(stage2_overlap[:N]) if torch.sum(
            stage2_overlap[:N]) < N else N
        conf[stage2_idx[:N]] += 1
    loc = encode(matches, priors, variances)
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn
    conf_t[idx] = conf  # [num_priors] top class label for each prior
 def match_ssd(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    """Match each prior box with the ground truth box of the highest jaccard
    overlap, encode the bounding boxes, then return the matched indices
    corresponding to both confidence and location preds.
    Args:
        threshold: (float) The overlap threshold used when mathing boxes.
        truths: (tensor) Ground truth boxes, Shape: [num_obj, num_priors].
        priors: (tensor) Prior boxes from priorbox layers, Shape: [n_priors,4].
        variances: (tensor) Variances corresponding to each prior coord,
            Shape: [num_priors, 4].
        labels: (tensor) All the class labels for the image, Shape: [num_obj].
        loc_t: (tensor) Tensor to be filled w/ endcoded location targets.
        conf_t: (tensor) Tensor to be filled w/ matched indices for conf preds.
        idx: (int) current batch index
    Return:
        The matched indices corresponding to 1)location and 2)confidence preds.
    """
    # jaccard index
    overlaps = jaccard(
        truths,
        point_form(priors)
    )
    # (Bipartite Matching)
    # [1,num_objects] best prior for each ground truth
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)
    # [1,num_priors] best ground truth for each prior
    best_truth_overlap, best_truth_idx = overlaps.max(
        0, keepdim=True)  # 0-2000
    best_truth_idx.squeeze_(0)
    best_truth_overlap.squeeze_(0)
    best_prior_idx.squeeze_(1)
    best_prior_overlap.squeeze_(1)
    best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior
    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
    matches = truths[best_truth_idx]          # Shape: [num_priors,4]
    conf = labels[best_truth_idx]         # Shape: [num_priors]
    conf[best_truth_overlap < threshold] = 0  # label as background
    loc = encode(matches, priors, variances)
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn
    conf_t[idx] = conf  # [num_priors] top class label for each prior
 def encode(matched, priors, variances):
    """Encode the variances from the priorbox layers into the ground truth boxes
    we have matched (based on jaccard overlap) with the prior boxes.
    Args:
        matched: (tensor) Coords of ground truth for each prior in point-form
            Shape: [num_priors, 4].
        priors: (tensor) Prior boxes in center-offset form
            Shape: [num_priors,4].
        variances: (list[float]) Variances of priorboxes
    Return:
        encoded boxes (tensor), Shape: [num_priors, 4]
    """
    # dist b/t match center and prior's center
    g_cxcy = (matched[:, :2] + matched[:, 2:]) / 2 - priors[:, :2]
    # encode variance
    g_cxcy /= (variances[0] * priors[:, 2:])
    # match wh / prior wh
    g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]
    #g_wh = torch.log(g_wh) / variances[1]
    g_wh = torch.log(g_wh) / variances[1]
    # return target for smooth_l1_loss
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]
 # Adapted from https://github.com/Hakuyume/chainer-ssd
 def decode(loc, priors, variances):
    """Decode locations from predictions using priors to undo
    the encoding we did for offset regression at train time.
    Args:
        loc (tensor): location predictions for loc layers,
            Shape: [num_priors,4]
        priors (tensor): Prior boxes in center-offset form.
            Shape: [num_priors,4].
        variances: (list[float]) Variances of priorboxes
    Return:
        decoded bounding box predictions
    """
    boxes = torch.cat((
        priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
        priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
    boxes[:, :2] -= boxes[:, 2:] / 2
    boxes[:, 2:] += boxes[:, :2]
    return boxes
 def log_sum_exp(x):
    """Utility function for computing log_sum_exp while determining
    This will be used to determine unaveraged confidence loss across
    all examples in a batch.
    Args:
        x (Variable(tensor)): conf_preds from conf layers
    """
    x_max = x.data.max()
    return torch.log(torch.sum(torch.exp(x - x_max), 1, keepdim=True)) + x_max
 # Original author: Francisco Massa:
 # https://github.com/fmassa/object-detection.torch
 # Ported to PyTorch by Max deGroot (02/01/2017)
 def nms(boxes, scores, overlap=0.5, top_k=200):
    """Apply non-maximum suppression at test time to avoid detecting too many
    overlapping bounding boxes for a given object.
    Args:
        boxes: (tensor) The location preds for the img, Shape: [num_priors,4].
        scores: (tensor) The class predscores for the img, Shape:[num_priors].
        overlap: (float) The overlap thresh for suppressing unnecessary boxes.
        top_k: (int) The Maximum number of box preds to consider.
    Return:
        The indices of the kept boxes with respect to num_priors.
    """
    keep = scores.new(scores.size(0)).zero_().long()
    if boxes.numel() == 0:
        return keep
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    area = torch.mul(x2 - x1, y2 - y1)
    v, idx = scores.sort(0)  # sort in ascending order
    # I = I[v >= 0.01]
    idx = idx[-top_k:]  # indices of the top-k largest vals
    xx1 = boxes.new()
    yy1 = boxes.new()
    xx2 = boxes.new()
    yy2 = boxes.new()
    w = boxes.new()
    h = boxes.new()
    # keep = torch.Tensor()
    count = 0
    while idx.numel() > 0:
        i = idx[-1]  # index of current largest val
        # keep.append(i)
        keep[count] = i
        count += 1
        if idx.size(0) == 1:
            break
        idx = idx[:-1]  # remove kept element from view
        # load bboxes of next highest vals
        torch.index_select(x1, 0, idx, out=xx1)
        torch.index_select(y1, 0, idx, out=yy1)
        torch.index_select(x2, 0, idx, out=xx2)
        torch.index_select(y2, 0, idx, out=yy2)
        # store element-wise max with next highest score
        xx1 = torch.clamp(xx1, min=x1[i])
        yy1 = torch.clamp(yy1, min=y1[i])
        xx2 = torch.clamp(xx2, max=x2[i])
        yy2 = torch.clamp(yy2, max=y2[i])
        w.resize_as_(xx2)
        h.resize_as_(yy2)
        w = xx2 - xx1
        h = yy2 - yy1
        # check sizes of xx1 and xx2.. after each iteration
        w = torch.clamp(w, min=0.0)
        h = torch.clamp(h, min=0.0)
        inter = w * h
        # IoU = i / (area(a) + area(b) - i)
        rem_areas = torch.index_select(area, 0, idx)  # load remaining areas)
        union = (rem_areas - inter) + area[i]
        IoU = inter / union  # store result in iou
        # keep only elements with an IoU <= overlap
        idx = idx[IoU.le(overlap)]
    return keep, count
--- a/examples/pytorch/vision/Face_Detection/layers/functions/init.py
+++ b/examples/pytorch/vision/Face_Detection/layers/functions/init.py
@ -0,0 +1,8 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from .prior_box import PriorBox
 from .detection import detect_function
 __all__=['detect_function','PriorBox']
--- a/examples/pytorch/vision/Face_Detection/layers/functions/detection.py
+++ b/examples/pytorch/vision/Face_Detection/layers/functions/detection.py
@ -0,0 +1,59 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import torch
 from ..bbox_utils import decode, nms
 def detect_function(cfg, loc_data, conf_data, prior_data):
    """
    Args:
        loc_data: (tensor) Loc preds from loc layers
            Shape: [batch,num_priors*4]
        conf_data: (tensor) Shape: Conf preds from conf layers
            Shape: [batch*num_priors,num_classes]
        prior_data: (tensor) Prior boxes and variances from priorbox layers
            Shape: [1,num_priors,4] 
    """
    with torch.no_grad():
        num = loc_data.size(0)
        num_priors = prior_data.size(0)
        conf_preds = conf_data.view(
            num, num_priors, cfg.NUM_CLASSES).transpose(2, 1)
        batch_priors = prior_data.view(-1, num_priors,
                                       4).expand(num, num_priors, 4)
        batch_priors = batch_priors.contiguous().view(-1, 4)
        decoded_boxes = decode(loc_data.view(-1, 4),
                               batch_priors, cfg.VARIANCE)
        decoded_boxes = decoded_boxes.view(num, num_priors, 4)
        output = torch.zeros(num, cfg.NUM_CLASSES, cfg.TOP_K, 5)
        for i in range(num):
            boxes = decoded_boxes[i].clone()
            conf_scores = conf_preds[i].clone()
            for cl in range(1, cfg.NUM_CLASSES):
                c_mask = conf_scores[cl].gt(cfg.CONF_THRESH)
                scores = conf_scores[cl][c_mask]
                if scores.dim() == 0:
                    continue
                l_mask = c_mask.unsqueeze(1).expand_as(boxes)
                boxes_ = boxes[l_mask].view(-1, 4)
                ids, count = nms(
                    boxes_, scores, cfg.NMS_THRESH, cfg.NMS_TOP_K)
                count = count if count < cfg.TOP_K else cfg.TOP_K
                output[i, cl, :count] = torch.cat((scores[ids[:count]].unsqueeze(1),
                                                   boxes_[ids[:count]]), 1)
    return output
--- a/examples/pytorch/vision/Face_Detection/layers/functions/prior_box.py
+++ b/examples/pytorch/vision/Face_Detection/layers/functions/prior_box.py
@ -0,0 +1,51 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 from itertools import product as product
 import math
 class PriorBox(object):
    """Compute priorbox coordinates in center-offset form for each source
    feature map.
    """
    def __init__(self, input_size, feature_maps,cfg):
        super(PriorBox, self).__init__()
        self.imh = input_size[0]
        self.imw = input_size[1]
        # number of priors for feature map location (either 4 or 6)
        self.variance = cfg.VARIANCE or [0.1]
        #self.feature_maps = cfg.FEATURE_MAPS
        self.min_sizes = cfg.ANCHOR_SIZES
        self.steps = cfg.STEPS
        self.clip = cfg.CLIP
        for v in self.variance:
            if v <= 0:
                raise ValueError('Variances must be greater than 0')
        self.feature_maps = feature_maps
    def forward(self):
        mean = []
        for k in range(len(self.feature_maps)):
            feath = self.feature_maps[k][0]
            featw = self.feature_maps[k][1]
            for i, j in product(range(feath), range(featw)):
                f_kw = self.imw / self.steps[k]
                f_kh = self.imh / self.steps[k]
                cx = (j + 0.5) / f_kw
                cy = (i + 0.5) / f_kh
                s_kw = self.min_sizes[k] / self.imw
                s_kh = self.min_sizes[k] / self.imh
                mean += [cx, cy, s_kw, s_kh]
        output = torch.Tensor(mean).view(-1, 4)
        if self.clip:
            output.clamp_(max=1, min=0)
        return output
--- a/examples/pytorch/vision/Face_Detection/layers/modules/init.py
+++ b/examples/pytorch/vision/Face_Detection/layers/modules/init.py
@ -0,0 +1,8 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from .l2norm import L2Norm
 from .multibox_loss import MultiBoxLoss
 __all__ = ['L2Norm', 'MultiBoxLoss']
--- a/examples/pytorch/vision/Face_Detection/layers/modules/l2norm.py
+++ b/examples/pytorch/vision/Face_Detection/layers/modules/l2norm.py
@ -0,0 +1,29 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 import torch.nn as nn
 import torch.nn.init as init
 class L2Norm(nn.Module):
    def __init__(self,n_channels, scale):
        super(L2Norm,self).__init__()
        self.n_channels = n_channels
        self.gamma = scale or None
        self.eps = 1e-10
        self.weight = nn.Parameter(torch.Tensor(self.n_channels))
        self.reset_parameters()
    def reset_parameters(self):
        init.constant_(self.weight,self.gamma)
    def forward(self, x):
        norm = x.pow(2).sum(dim=1, keepdim=True).sqrt()+self.eps
        #x /= norm
        x = torch.div(x,norm)
        out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
        return out
--- a/examples/pytorch/vision/Face_Detection/layers/modules/multibox_loss.py
+++ b/examples/pytorch/vision/Face_Detection/layers/modules/multibox_loss.py
@ -0,0 +1,118 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import math
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.autograd import Variable
 from ..bbox_utils import match, log_sum_exp, match_ssd
 class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """
    def __init__(self, cfg, dataset, use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = cfg.NUM_CLASSES
        self.negpos_ratio = cfg.NEG_POS_RATIOS
        self.variance = cfg.VARIANCE
        self.dataset = dataset
        self.threshold = cfg.FACE.OVERLAP_THRESH
        self.match = match
    def forward(self, predictions, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)
            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0)
        priors = priors[:loc_data.size(1), :]
        num_priors = (priors.size(0))
        num_classes = self.num_classes
        # match priors (default boxes) and ground truth boxes
        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:, :-1].data
            labels = targets[idx][:, -1].data
            defaults = priors.data
            self.match(self.threshold, truths, defaults, self.variance, labels,
                       loc_t, conf_t, idx)
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t, requires_grad=False)
        pos = conf_t > 0
        num_pos = pos.sum(dim=1, keepdim=True)
        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4)
        loc_t = loc_t[pos_idx].view(-1, 4)
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)
        # print(loc_p)
        # Compute max conf across batch for hard negative mining
        batch_conf = conf_data.view(-1, self.num_classes)
        loss_c = log_sum_exp(batch_conf) - \
            batch_conf.gather(1, conf_t.view(-1, 1))
        # Hard Negative Mining
        loss_c[pos.view(-1, 1)] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio *
                              num_pos, max=pos.size(1) - 1)
        neg = idx_rank < num_neg.expand_as(idx_rank)
        # Confidence Loss Including Positive and Negative Examples
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)
        conf_p = conf_data[(pos_idx + neg_idx).gt(0)
                           ].view(-1, self.num_classes)
        targets_weighted = conf_t[(pos + neg).gt(0)]
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)
        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        N = num_pos.data.sum() if num_pos.data.sum() > 0 else num
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c
--- a/examples/pytorch/vision/Face_Detection/models/RPool_Face_C.py
+++ b/examples/pytorch/vision/Face_Detection/models/RPool_Face_C.py
@ -0,0 +1,382 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import os
 import torch
 import torch.nn as nn
 import torch.nn.init as init
 import torch.nn.functional as F
 from layers import *
 from data.config import cfg
 import numpy as np
 from edgeml_pytorch.graph.rnnpool import *
 class S3FD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    Args:
        phase: (string) Can be "test" or "train"
        size: input image size
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers
    """
    def __init__(self, phase, base, head, num_classes):
        super(S3FD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        '''
        self.priorbox = PriorBox(size,cfg)
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        '''
        # SSD network
        self.unfold = nn.Unfold(kernel_size=(8,8),stride=(4,4))
        self.rnn_model = RNNPool(8, 8, 16, 16, 3)
        self.mob = nn.ModuleList(base)
        # Layer learns to scale the l2 normalized features from conv4_3
        self.L2Norm3_3 = L2Norm(24, 10)
        self.L2Norm4_3 = L2Norm(32, 8)
        self.L2Norm5_3 = L2Norm(64, 5)
        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])
        if self.phase == 'test':
            self.softmax = nn.Softmax(dim=-1)
            # self.detect = Detect(cfg)
    def forward(self, x):
        """Applies network layers and ops on input image(s) x.
        Args:
            x: input image or batch of images. Shape: [batch,3,300,300].
        Return:
            Depending on phase:
            test:
                Variable(tensor) of output class label predictions,
                confidence score, and corresponding location predictions for
                each object detected. Shape: [batch,topk,7]
            train:
                list of concat outputs from:
                    1: confidence layers, Shape: [batch*num_priors,num_classes]
                    2: localization layers, Shape: [batch,num_priors*4]
                    3: priorbox layers, Shape: [2,num_priors*4]
        """
        size = x.size()[2:]
        batch_size = x.shape[0]
        sources = list()
        loc = list()
        conf = list()
        patches = self.unfold(x)
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,3,8,8))
        output_x = int((x.shape[2]-8)/4 + 1)
        output_y = int((x.shape[3]-8)/4 + 1)
        rnnX = self.rnn_model(patches, int(batch_size)*output_x*output_y)
        x = torch.stack(torch.split(rnnX, split_size_or_sections=int(batch_size), dim=0),dim=2)
        x = F.fold(x, kernel_size=(1,1), output_size=(output_x,output_y))
        x = F.pad(x, (0,1,0,1), mode='replicate')
        for k in range(2):
            x = self.mob[k](x)
        s = self.L2Norm3_3(x)
        sources.append(s)
        for k in range(2, 5):
            x = self.mob[k](x)
        s = self.L2Norm4_3(x)
        sources.append(s)
        for k in range(5, 9):
            x = self.mob[k](x)
        s = self.L2Norm5_3(x)
        sources.append(s)
        for k in range(9, 12):
            x = self.mob[k](x)
        sources.append(x)
        for k in range(12, 14):
            x = self.mob[k](x)
        sources.append(x)
        for k in range(14, 15):
            x = self.mob[k](x)
        sources.append(x)
        # apply multibox head to source layers
        loc_x = self.loc[0](sources[0])
        conf_x = self.conf[0](sources[0])
        max_conf, _ = torch.max(conf_x[:, 0:3, :, :], dim=1, keepdim=True)
        conf_x = torch.cat((max_conf, conf_x[:, 3:, :, :]), dim=1)
        loc.append(loc_x.permute(0, 2, 3, 1).contiguous())
        conf.append(conf_x.permute(0, 2, 3, 1).contiguous())
        for i in range(1, len(sources)):
            x = sources[i]
            conf.append(self.conf[i](x).permute(0, 2, 3, 1).contiguous())
            loc.append(self.loc[i](x).permute(0, 2, 3, 1).contiguous())
        features_maps = []
        for i in range(len(loc)):
            feat = []
            feat += [loc[i].size(1), loc[i].size(2)]
            features_maps += [feat]
        self.priorbox = PriorBox(size, features_maps, cfg)
        self.priors = self.priorbox.forward()
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
        if self.phase == 'test':
            output = detect_function(
                loc.view(loc.size(0), -1, 4),                   # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                                       self.num_classes)),                # conf preds
                self.priors.type(type(x.data))                  # default boxes
            )
        else:
            output = (
                loc.view(loc.size(0), -1, 4),
                conf.view(conf.size(0), -1, self.num_classes),
                self.priors
            )
        return output
    def load_weights(self, base_file):
        other, ext = os.path.splitext(base_file)
        if ext == '.pkl' or '.pth':
            print('Loading weights into state dict...')
            mdata = torch.load(base_file,
                               map_location=lambda storage, loc: storage)
            weights = mdata['weight']
            epoch = mdata['epoch']
            self.load_state_dict(weights)
            print('Finished!')
        else:
            print('Sorry only .pth and .pkl files supported.')
        return epoch
    def xavier(self, param):
        init.xavier_uniform(param)
    def weights_init(self, m):
        if isinstance(m, nn.Conv2d):
            self.xavier(m.weight.data)
            m.bias.data.zero_()
 def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU6(inplace=True)
        )
 class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            # pw
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            # dw
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            # pw-linear
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)
 class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8):
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
        """
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = 64
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s
                # [1, 16, 1, 1],
                [1, 24, 1, 1],
                [6, 24, 1, 1],
                [6, 32, 3, 2],
                [6, 64, 4, 2],
                [6, 96, 3, 2],
                [6, 160, 2, 2],
                [6, 320, 1, 2],
            ]
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))
        # building first layer
        input_channel = _make_divisible(input_channel * width_mult, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        self.layers = []
        # building inverted residual blocks
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                self.layers.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
 def multibox(mobilenet, num_classes):
    loc_layers = []
    conf_layers = []
    loc_layers += [nn.Conv2d(24, 4,
                             kernel_size=5, padding=2)]
    conf_layers += [nn.Conv2d(24,
                              3 + (num_classes-1), kernel_size=5, padding=2)]
    loc_layers += [nn.Conv2d(32,
                                 4, kernel_size=5, padding=2)]
    conf_layers += [nn.Conv2d(32,
                                  num_classes, kernel_size=5, padding=2)]
    loc_layers += [nn.Conv2d(64,
                                 4, kernel_size=5, padding=2)]
    conf_layers += [nn.Conv2d(64,
                                  num_classes, kernel_size=5, padding=2)]
    loc_layers += [nn.Conv2d(96,
                                 4, kernel_size=1, padding=0)]
    conf_layers += [nn.Conv2d(96,
                                  num_classes, kernel_size=1, padding=0)]
    loc_layers += [nn.Conv2d(160,
                                 4, kernel_size=1, padding=0)]
    conf_layers += [nn.Conv2d(160,
                                  num_classes, kernel_size=1, padding=0)]
    loc_layers += [nn.Conv2d(320,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(320,
                                  num_classes, kernel_size=3, padding=1)]
    return mobilenet, (loc_layers, conf_layers)
 def build_s3fd(phase, num_classes=2):
    base_, head_ = multibox(
        MobileNetV2().layers, num_classes)
    return S3FD(phase, base_, head_, num_classes)
 if __name__ == '__main__':
    net = build_s3fd('train', num_classes=2)
    inputs = Variable(torch.randn(4, 3, 640, 640))
    output = net(inputs)
--- a/examples/pytorch/vision/Face_Detection/models/RPool_Face_QVGA_monochrome.py
+++ b/examples/pytorch/vision/Face_Detection/models/RPool_Face_QVGA_monochrome.py
@ -0,0 +1,348 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import os
 import torch
 import torch.nn as nn
 import torch.nn.init as init
 import torch.nn.functional as F
 from layers import *
 from data.config_qvga import cfg
 import numpy as np
 from edgeml_pytorch.graph.rnnpool import *
 class S3FD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    Args:
        phase: (string) Can be "test" or "train"
        size: input image size
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers
    """
    def __init__(self, phase, base, head, num_classes):
        super(S3FD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        '''
        self.priorbox = PriorBox(size,cfg)
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        '''
        # SSD network
        self.conv = ConvBNReLU(1, 4, stride=2)
        self.unfold = nn.Unfold(kernel_size=(8,8),stride=(4,4))
        self.rnn_model = RNNPool(8, 8, 16, 16, 4)#num_init_features)
        self.mob = nn.ModuleList(base)
        # Layer learns to scale the l2 normalized features from conv4_3
        self.L2Norm3_3 = L2Norm(32, 10)
        self.L2Norm4_3 = L2Norm(32, 8)
        self.L2Norm5_3 = L2Norm(96, 5)
        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])
        if self.phase == 'test':
            self.softmax = nn.Softmax(dim=-1) 
    def forward(self, x):
        """Applies network layers and ops on input image(s) x.
        Args:
            x: input image or batch of images. Shape: [batch,3,300,300].
        Return:
            Depending on phase:
            test:
                Variable(tensor) of output class label predictions,
                confidence score, and corresponding location predictions for
                each object detected. Shape: [batch,topk,7]
            train:
                list of concat outputs from:
                    1: confidence layers, Shape: [batch*num_priors,num_classes]
                    2: localization layers, Shape: [batch,num_priors*4]
                    3: priorbox layers, Shape: [2,num_priors*4]
        """
        size = x.size()[2:]
        batch_size = x.shape[0]
        sources = list()
        loc = list()
        conf = list()
        x = self.conv(x)
        patches = self.unfold(x)
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,4,8,8))
        output_x = int((x.shape[2]-8)/4 + 1)
        output_y = int((x.shape[3]-8)/4 + 1)
        rnnX = self.rnn_model(patches, int(batch_size)*output_x*output_y)
        x = torch.stack(torch.split(rnnX, split_size_or_sections=int(batch_size), dim=0),dim=2)
        x = F.fold(x, kernel_size=(1,1), output_size=(output_x,output_y))
        x = F.pad(x, (0,1,0,1), mode='replicate')
        for k in range(4):
            x = self.mob[k](x)
        s = self.L2Norm3_3(x)
        sources.append(s)
        for k in range(4, 8):
            x = self.mob[k](x)
        s = self.L2Norm4_3(x)
        sources.append(s)
        for k in range(8, 11):
            x = self.mob[k](x)
        s = self.L2Norm5_3(x)
        sources.append(s)
        for k in range(11, 14):
            x = self.mob[k](x)
        sources.append(x)
        # apply multibox head to source layers
        loc_x = self.loc[0](sources[0])
        conf_x = self.conf[0](sources[0])
        max_conf, _ = torch.max(conf_x[:, 0:3, :, :], dim=1, keepdim=True)
        conf_x = torch.cat((max_conf, conf_x[:, 3:, :, :]), dim=1)
        loc.append(loc_x.permute(0, 2, 3, 1).contiguous())
        conf.append(conf_x.permute(0, 2, 3, 1).contiguous())
        for i in range(1, len(sources)):
            x = sources[i]
            conf.append(self.conf[i](x).permute(0, 2, 3, 1).contiguous())
            loc.append(self.loc[i](x).permute(0, 2, 3, 1).contiguous())
        features_maps = []
        for i in range(len(loc)):
            feat = []
            feat += [loc[i].size(1), loc[i].size(2)]
            features_maps += [feat]
        self.priorbox = PriorBox(size, features_maps, cfg)
        self.priors = self.priorbox.forward()
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
        if self.phase == 'test':
            output = detect_function(cfg,
                loc.view(loc.size(0), -1, 4),                   # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                                       self.num_classes)),      # conf preds
                self.priors.type(type(x.data))                  # default boxes
            )
        else:
            output = (
                loc.view(loc.size(0), -1, 4),
                conf.view(conf.size(0), -1, self.num_classes),
                self.priors
            )
        return output
    def load_weights(self, base_file):
        other, ext = os.path.splitext(base_file)
        if ext == '.pkl' or '.pth':
            print('Loading weights into state dict...')
            mdata = torch.load(base_file,
                               map_location=lambda storage, loc: storage)
            weights = mdata['weight']
            epoch = mdata['epoch']
            self.load_state_dict(weights)
            print('Finished!')
        else:
            print('Sorry only .pth and .pkl files supported.')
        return epoch
    def xavier(self, param):
        init.xavier_uniform(param)
    def weights_init(self, m):
        if isinstance(m, nn.Conv2d):
            self.xavier(m.weight.data)
            m.bias.data.zero_()
 def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU6(inplace=True)
        )
 class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            # pw
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            # dw
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            # pw-linear
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)
 class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8):
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
        """
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = 64
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s               
                [2, 32, 4, 1],
                [2, 32, 4, 1],
                [2, 96, 3, 2],
                [2, 128, 3, 1],              
            ]
        # only check the first element, assuming user knows t,c,n,s are required
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))
        # building first layer
        input_channel = _make_divisible(input_channel * width_mult, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        self.layers = []
        # building inverted residual blocks
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                self.layers.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
 def multibox(mobilenet, num_classes):
    loc_layers = []
    conf_layers = []
    loc_layers += [nn.Conv2d(32, 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(32, 3 + (num_classes-1), kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(32, 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(32, num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(96, 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(96, num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(128, 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(128, num_classes, kernel_size=3, padding=1)]
    return mobilenet, (loc_layers, conf_layers)
 def build_s3fd(phase, num_classes=2):
    base_, head_ = multibox(
        MobileNetV2().layers, num_classes)
    return S3FD(phase, base_, head_, num_classes)
 if __name__ == '__main__':
    net = build_s3fd('train', num_classes=2)
    inputs = Variable(torch.randn(4, 1, 320, 320))
    output = net(inputs)
--- a/examples/pytorch/vision/Face_Detection/models/RPool_Face_Quant.py
+++ b/examples/pytorch/vision/Face_Detection/models/RPool_Face_Quant.py
@ -0,0 +1,375 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import os
 import torch
 import torch.nn as nn
 import torch.nn.init as init
 import torch.nn.functional as F
 from layers import *
 from data.config import cfg
 import numpy as np
 from edgeml_pytorch.graph.rnnpool import *
 class S3FD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    Args:
        phase: (string) Can be "test" or "train"
        size: input image size
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers
    """
    def __init__(self, phase, base, head, num_classes):
        super(S3FD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        '''
        self.priorbox = PriorBox(size,cfg)
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        '''
        # SSD network
        self.conv_top = nn.Sequential(ConvBNReLU(3, 4, kernel_size=3, stride=2), ConvBNReLU(4, 4, kernel_size=3))
        self.unfold = nn.Unfold(kernel_size=(8,8),stride=(4,4))
        self.rnn_model = RNNPool(8, 8, 8, 8, 4)
        self.mob = nn.ModuleList(base)
        # Layer learns to scale the l2 normalized features from conv4_3
        self.L2Norm3_3 = L2Norm(4, 10)
        self.L2Norm4_3 = L2Norm(16, 8)
        self.L2Norm5_3 = L2Norm(24, 5)
        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])
        if self.phase == 'test':
            self.softmax = nn.Softmax(dim=-1) 
    def forward(self, x):
        """Applies network layers and ops on input image(s) x.
        Args:
            x: input image or batch of images. Shape: [batch,3,300,300].
        Return:
            Depending on phase:
            test:
                Variable(tensor) of output class label predictions,
                confidence score, and corresponding location predictions for
                each object detected. Shape: [batch,topk,7]
            train:
                list of concat outputs from:
                    1: confidence layers, Shape: [batch*num_priors,num_classes]
                    2: localization layers, Shape: [batch,num_priors*4]
                    3: priorbox layers, Shape: [2,num_priors*4]
        """
        size = x.size()[2:]
        batch_size = x.shape[0]
        sources = list()
        loc = list()
        conf = list()
        x = self.conv_top(x)
        s = self.L2Norm3_3(x)
        sources.append(s)
        patches = self.unfold(x)
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,4,8,8))
        output_x = int((x.shape[2]-8)/4 + 1)
        output_y = int((x.shape[3]-8)/4 + 1)
        rnnX = self.rnn_model(patches, int(batch_size)*output_x*output_y)
        x = torch.stack(torch.split(rnnX, split_size_or_sections=int(batch_size), dim=0),dim=2)
        x = F.fold(x, kernel_size=(1,1), output_size=(output_x,output_y))
        x = F.pad(x, (0,1,0,1), mode='replicate')
        for k in range(4):
            x = self.mob[k](x)
        s = self.L2Norm4_3(x)
        sources.append(s)
        for k in range(4, 8):
            x = self.mob[k](x)
        s = self.L2Norm5_3(x)
        sources.append(s)
        for k in range(8, 10):
            x = self.mob[k](x)
        sources.append(x)
        for k in range(10, 11):
            x = self.mob[k](x)
        sources.append(x)
        for k in range(11, 12):
            x = self.mob[k](x)
        sources.append(x)
        # apply multibox head to source layers
        loc_x = self.loc[0](sources[0])
        conf_x = self.conf[0](sources[0])
        loc_x = self.loc[1](loc_x)
        conf_x = self.conf[1](conf_x)
        max_conf, _ = torch.max(conf_x[:, 0:3, :, :], dim=1, keepdim=True)
        conf_x = torch.cat((max_conf, conf_x[:, 3:, :, :]), dim=1)
        loc.append(loc_x.permute(0, 2, 3, 1).contiguous())
        conf.append(conf_x.permute(0, 2, 3, 1).contiguous())
        for i in range(1, len(sources)):
            x = sources[i]
            conf.append(self.conf[i+1](x).permute(0, 2, 3, 1).contiguous())
            loc.append(self.loc[i+1](x).permute(0, 2, 3, 1).contiguous())
        features_maps = []
        for i in range(len(loc)):
            feat = []
            feat += [loc[i].size(1), loc[i].size(2)]
            features_maps += [feat]
        self.priorbox = PriorBox(size, features_maps, cfg)
        self.priors = self.priorbox.forward()
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
        if self.phase == 'test':
            output = detect_function(cfg,
                loc.view(loc.size(0), -1, 4),                   # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                                       self.num_classes)),                # conf preds
                self.priors.type(type(x.data))                  # default boxes
            )
        else:
            output = (
                loc.view(loc.size(0), -1, 4),
                conf.view(conf.size(0), -1, self.num_classes),
                self.priors
            )
        return output
    def load_weights(self, base_file):
        other, ext = os.path.splitext(base_file)
        if ext == '.pkl' or '.pth':
            print('Loading weights into state dict...')
            mdata = torch.load(base_file,
                               map_location=lambda storage, loc: storage)
            weights = mdata['weight']
            epoch = mdata['epoch']
            self.load_state_dict(weights)
            print('Finished!')
        else:
            print('Sorry only .pth and .pkl files supported.')
        return epoch
    def xavier(self, param):
        init.xavier_uniform(param)
    def weights_init(self, m):
        if isinstance(m, nn.Conv2d):
            self.xavier(m.weight.data)
            m.bias.data.zero_()
 def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU6(inplace=True)
        )
 class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            # pw
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            # dw
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            # pw-linear
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)
 class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, width_mult=1.0, inverted_residual_setting=None, round_nearest=8):
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
        """
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = 32
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s
                [2, 16, 4, 1],
                [2, 24, 4, 2],
                [2, 32, 2, 2],
                [2, 64, 1, 2],
                [2, 96, 1, 2],
            ]
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))
        # building first layer
        input_channel = _make_divisible(input_channel * width_mult, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        self.layers = []
        # building inverted residual blocks
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                self.layers.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
 def multibox(mobilenet, num_classes):
    loc_layers = []
    conf_layers = []
    loc_layers += nn.Sequential(ConvBNReLU(4, 8, kernel_size=3, stride=2),
                    nn.Conv2d(8, 4,  kernel_size=3, padding=1))
    conf_layers += nn.Sequential(ConvBNReLU(4, 8, kernel_size=3, stride=2),
                    nn.Conv2d(8, 3 + (num_classes-1), kernel_size=3, padding=1))
    loc_layers += [nn.Conv2d(16,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(16,
                                  num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(24,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(24,
                                  num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(32,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(32,
                                  num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(64,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(64,
                                  num_classes, kernel_size=3, padding=1)]
    loc_layers += [nn.Conv2d(96,
                                 4, kernel_size=3, padding=1)]
    conf_layers += [nn.Conv2d(96,
                                  num_classes, kernel_size=3, padding=1)]
    return mobilenet, (loc_layers, conf_layers)
 def build_s3fd(phase, num_classes=2):
    base_, head_ = multibox(
        MobileNetV2().layers, num_classes)
    return S3FD(phase, base_, head_, num_classes)
 if __name__ == '__main__':
    net = build_s3fd('train', num_classes=2)
    inputs = Variable(torch.randn(4, 3, 640, 640))
    output = net(inputs)
--- a/examples/pytorch/vision/Face_Detection/models/init.py
+++ b/examples/pytorch/vision/Face_Detection/models/init.py
@ -0,0 +1,6 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import absolute_import
 from .RPool_Face_C import *
 from .RPool_Face_Quant import *
--- a/examples/pytorch/vision/Face_Detection/prepare_wider_data.py
+++ b/examples/pytorch/vision/Face_Detection/prepare_wider_data.py
@ -0,0 +1,86 @@
 ## This code is built on https://github.com/yxlijun/S3FD.pytorch
 #-*- coding:utf-8 -*-
 from __future__ import division
 from __future__ import absolute_import
 from __future__ import print_function
 import os
 from data.config import cfg
 import cv2
 WIDER_ROOT = os.path.join(cfg.HOME, 'WIDER_FACE')
 train_list_file = os.path.join(WIDER_ROOT, 'wider_face_split',
                               'wider_face_train_bbx_gt.txt')
 val_list_file = os.path.join(WIDER_ROOT, 'wider_face_split',
                             'wider_face_val_bbx_gt.txt')
 WIDER_TRAIN = os.path.join(WIDER_ROOT, 'WIDER_train', 'images')
 WIDER_VAL = os.path.join(WIDER_ROOT, 'WIDER_val', 'images')
 def parse_wider_file(root, file):
    with open(file, 'r') as fr:
        lines = fr.readlines()
    face_count = []
    img_paths = []
    face_loc = []
    img_faces = []
    count = 0
    flag = False
    for k, line in enumerate(lines):
        line = line.strip().strip('\n')
        if count > 0:
            line = line.split(' ')
            count -= 1
            loc = [int(line[0]), int(line[1]), int(line[2]), int(line[3])]
            face_loc += [loc]
        if flag:
            face_count += [int(line)]
            flag = False
            count = int(line)
        if 'jpg' in line:
            img_paths += [os.path.join(root, line)]
            flag = True
    total_face = 0
    for k in face_count:
        face_ = []
        for x in range(total_face, total_face + k):
            face_.append(face_loc[x])
        img_faces += [face_]
        total_face += k
    return img_paths, img_faces
 def wider_data_file():
    img_paths, bbox = parse_wider_file(WIDER_TRAIN, train_list_file)
    fw = open(cfg.FACE.TRAIN_FILE, 'w')
    for index in range(len(img_paths)):
        path = img_paths[index]
        boxes = bbox[index]
        fw.write(path)
        fw.write(' {}'.format(len(boxes)))
        for box in boxes:
            data = ' {} {} {} {} {}'.format(box[0], box[1], box[2], box[3], 1)
            fw.write(data)
        fw.write('\n')
    fw.close()
    img_paths, bbox = parse_wider_file(WIDER_VAL, val_list_file)
    fw = open(cfg.FACE.VAL_FILE, 'w')
    for index in range(len(img_paths)):
        path = img_paths[index]
        boxes = bbox[index]
        fw.write(path)
        fw.write(' {}'.format(len(boxes)))
        for box in boxes:
            data = ' {} {} {} {} {}'.format(box[0], box[1], box[2], box[3], 1)
            fw.write(data)
        fw.write('\n')
    fw.close()
 if __name__ == '__main__':
    wider_data_file()
--- a/examples/pytorch/vision/Face_Detection/requirements.txt
+++ b/examples/pytorch/vision/Face_Detection/requirements.txt
@ -0,0 +1,10 @@
 Cython==0.29.15
 easydict==1.9
 importlib-metadata==1.5.0
 matplotlib==3.2.1
 opencv-python-headless==4.2.0.32
 PyYAML==3.12
 scikit-image==0.15.0
 tensorboard==1.14.0
 tensorboardX==1.9
 tqdm==4.36.1
--- a/examples/pytorch/vision/Face_Detection/train.py
+++ b/examples/pytorch/vision/Face_Detection/train.py
@ -0,0 +1,244 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from data import *
 from layers.modules import MultiBoxLoss
 import os
 import time
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import torch.nn.init as init
 import torch.utils.data as data
 import numpy as np
 import argparse
 import torch.backends.cudnn as cudnn
 from data.choose_config import cfg
 cfg = cfg.cfg
 from importlib import import_module
 def str2bool(v):
    return v.lower() in ("yes", "true", "t", "1")
 parser = argparse.ArgumentParser(
    description='S3FD face Detector Training With Pytorch')
 train_set = parser.add_mutually_exclusive_group()
 parser.add_argument('--dataset',
                    default='face',
                    choices=['hand', 'face', 'head'],
                    help='Train target')
 parser.add_argument('--basenet',
                    default='vgg16_reducedfc.pth',
                    help='Pretrained base model')
 parser.add_argument('--batch_size',
                    default=16, type=int,
                    help='Batch size for training')
 parser.add_argument('--resume',
                    default=None, type=str,
                    help='Checkpoint state_dict file to resume training from')
 parser.add_argument('--model_arch',
                    default='RPool_Face_C', type=str,
                    choices=['RPool_Face_C', 'RPool_Face_Quant', 'RPool_Face_QVGA_monochrome'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--num_workers',
                    default=128, type=int,
                    help='Number of workers used in dataloading')
 parser.add_argument('--cuda',
                    default=True, type=str2bool,
                    help='Use CUDA to train model')
 parser.add_argument('--lr', '--learning-rate',
                    default=1e-2, type=float,
                    help='initial learning rate')
 parser.add_argument('--momentum',
                    default=0.9, type=float,
                    help='Momentum value for optim')
 parser.add_argument('--weight_decay',
                    default=5e-4, type=float,
                    help='Weight decay for SGD')
 parser.add_argument('--gamma',
                    default=0.1, type=float,
                    help='Gamma update for SGD')
 parser.add_argument('--multigpu',
                    default=False, type=str2bool,
                    help='Use mutil Gpu training')
 parser.add_argument('--save_folder',
                    default='weights/',
                    help='Directory for saving checkpoint models')
 parser.add_argument('--epochs',
                    default=300, type=int,
                    help='total epochs')
 parser.add_argument('--save_frequency',
                    default=5000, type=int,
                    help='iterations interval after which checkpoint is saved')
 args = parser.parse_args()
 if torch.cuda.is_available():
    if args.cuda:
        torch.set_default_tensor_type('torch.cuda.FloatTensor')
    if not args.cuda:
        print("WARNING: It looks like you have a CUDA device, but aren't " +
              "using CUDA.\nRun with --cuda for optimal training speed.")
        torch.set_default_tensor_type('torch.FloatTensor')
 else:
    torch.set_default_tensor_type('torch.FloatTensor')
 if not os.path.exists(args.save_folder):
    os.makedirs(args.save_folder)
 train_dataset = WIDERDetection(cfg.FACE.TRAIN_FILE, mode='train', mono_mode=cfg.IS_MONOCHROME)
 val_dataset = WIDERDetection(cfg.FACE.VAL_FILE, mode='val', mono_mode=cfg.IS_MONOCHROME)
 train_loader = data.DataLoader(train_dataset, args.batch_size,
                               num_workers=args.num_workers,
                               shuffle=True,
                               collate_fn=detection_collate,
                               pin_memory=True)
 val_batchsize = args.batch_size // 2
 val_loader = data.DataLoader(val_dataset, val_batchsize,
                             num_workers=args.num_workers,
                             shuffle=False,
                             collate_fn=detection_collate,
                             pin_memory=True)
 min_loss = np.inf
 start_epoch = 0
 module = import_module('models.' + args.model_arch)
 net = module.build_s3fd('train', cfg.NUM_CLASSES)
 if args.cuda:
    if args.multigpu:
        net = torch.nn.DataParallel(net)
    net = net.cuda()
    cudnn.benckmark = True
 if args.resume:
    print('Resuming training, loading {}...'.format(args.resume))
    net.load_state_dict(torch.load(args.resume))
 optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum,
                      weight_decay=args.weight_decay)
 criterion = MultiBoxLoss(cfg, args.dataset, args.cuda)
 print('Loading wider dataset...')
 print('Using the specified args:')
 print(args)
 def train():
    step_index = 0
    iteration = 0
    for epoch in range(start_epoch, args.epochs):
        net.train()
        losses = 0
        train_loader_len = len(train_loader)
        for batch_idx, (images, targets) in enumerate(train_loader):
            adjust_learning_rate(optimizer, epoch, batch_idx, train_loader_len)
            if args.cuda:
                images = images.cuda()
                targets = [ann.cuda()
                           for ann in targets]
            else:
                images = images
                targets = [ann for ann in targets]
            t0 = time.time()
            out = net(images)
            # backprop
            optimizer.zero_grad()
            loss_l, loss_c = criterion(out, targets)
            loss = loss_l + loss_c
            loss.backward()
            optimizer.step()
            t1 = time.time()
            losses += loss.item()
            if iteration % 10 == 0:
                tloss = losses / (batch_idx + 1)
                print('Timer: %.4f' % (t1 - t0))
                print('epoch:' + repr(epoch) + ' || iter:' +
                      repr(iteration) + ' || Loss:%.4f' % (tloss))
                print('->> conf loss:{:.4f} || loc loss:{:.4f}'.format(
                    loss_c.item(), loss_l.item()))
                print('->>lr:{:.6f}'.format(optimizer.param_groups[0]['lr']))
            if iteration != 0 and iteration % args.save_frequency == 0:
                print('Saving state, iter:', iteration)
                file = 'rpool_' + args.dataset + '_' + repr(iteration) + '_checkpoint.pth'
                torch.save(net.state_dict(),
                           os.path.join(args.save_folder, file))
            iteration += 1
        val(epoch)
        if iteration == cfg.MAX_STEPS:
            break
 def val(epoch):
    net.eval()
    loc_loss = 0
    conf_loss = 0
    step = 0
    t1 = time.time()
    with torch.no_grad():
        for batch_idx, (images, targets) in enumerate(val_loader):
            if args.cuda:
                images = images.cuda()
                targets = [ann.cuda()
                           for ann in targets]
            else:
                images = images
                targets = [ann for ann in targets]
            out = net(images)
            loss_l, loss_c = criterion(out, targets)
            loss = loss_l + loss_c
            loc_loss += loss_l.item()
            conf_loss += loss_c.item()
            step += 1
    tloss = (loc_loss + conf_loss) / step
    t2 = time.time()
    print('Timer: %.4f' % (t2 - t1))
    print('test epoch:' + repr(epoch) + ' || Loss:%.4f' % (tloss))
    global min_loss
    if tloss < min_loss:
        print('Saving best state,epoch', epoch)
        file = '{}_best_state.pth'.format(args.model_arch)
        torch.save(net.state_dict(), os.path.join(
            args.save_folder, file))
        min_loss = tloss
 from math import cos, pi
 def adjust_learning_rate(optimizer, epoch, iteration, num_iter):
    lr = optimizer.param_groups[0]['lr']
    warmup_epoch = 5
    warmup_iter = warmup_epoch * num_iter
    current_iter = iteration + epoch * num_iter
    max_iter = args.epochs * num_iter
    lr = args.lr * (1 + cos(pi * (current_iter - warmup_iter) / (max_iter - warmup_iter))) / 2
    if epoch < warmup_epoch:
        lr = args.lr * current_iter / warmup_iter
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
 if __name__ == '__main__':
    train()
--- a/examples/pytorch/vision/Face_Detection/utils/init.py
+++ b/examples/pytorch/vision/Face_Detection/utils/init.py
@ -0,0 +1,4 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from .augmentations import *
--- a/examples/pytorch/vision/Face_Detection/utils/augmentations.py
+++ b/examples/pytorch/vision/Face_Detection/utils/augmentations.py
@ -0,0 +1,862 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import torch
 from torchvision import transforms
 import cv2
 import numpy as np
 import types
 from PIL import Image, ImageEnhance, ImageDraw
 import math
 import six
 import sys; sys.path.append('../')
 from data.choose_config import cfg
 cfg = cfg.cfg
 import random
 class sampler():
    def __init__(self,
                 max_sample,
                 max_trial,
                 min_scale,
                 max_scale,
                 min_aspect_ratio,
                 max_aspect_ratio,
                 min_jaccard_overlap,
                 max_jaccard_overlap,
                 min_object_coverage,
                 max_object_coverage,
                 use_square=False):
        self.max_sample = max_sample
        self.max_trial = max_trial
        self.min_scale = min_scale
        self.max_scale = max_scale
        self.min_aspect_ratio = min_aspect_ratio
        self.max_aspect_ratio = max_aspect_ratio
        self.min_jaccard_overlap = min_jaccard_overlap
        self.max_jaccard_overlap = max_jaccard_overlap
        self.min_object_coverage = min_object_coverage
        self.max_object_coverage = max_object_coverage
        self.use_square = use_square
 def intersect(box_a, box_b):
    max_xy = np.minimum(box_a[:, 2:], box_b[2:])
    min_xy = np.maximum(box_a[:, :2], box_b[:2])
    inter = np.clip((max_xy - min_xy), a_min=0, a_max=np.inf)
    return inter[:, 0] * inter[:, 1]
 def jaccard_numpy(box_a, box_b):
    """Compute the jaccard overlap of two sets of boxes.  The jaccard overlap
    is simply the intersection over union of two boxes.
    E.g.:
        A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)
    Args:
        box_a: Multiple bounding boxes, Shape: [num_boxes,4]
        box_b: Single bounding box, Shape: [4]
    Return:
        jaccard overlap: Shape: [box_a.shape[0], box_a.shape[1]]
    """
    inter = intersect(box_a, box_b)
    area_a = ((box_a[:, 2] - box_a[:, 0]) *
              (box_a[:, 3] - box_a[:, 1]))  # [A,B]
    area_b = ((box_b[2] - box_b[0]) *
              (box_b[3] - box_b[1]))  # [A,B]
    union = area_a + area_b - inter
    return inter / union  # [A,B]
 class bbox():
    def __init__(self, xmin, ymin, xmax, ymax):
        self.xmin = xmin
        self.ymin = ymin
        self.xmax = xmax
        self.ymax = ymax
 def random_brightness(img):
    prob = np.random.uniform(0, 1)
    if prob < cfg.brightness_prob:
        delta = np.random.uniform(-cfg.brightness_delta,
                                  cfg.brightness_delta) + 1
        img = ImageEnhance.Brightness(img).enhance(delta)
    return img
 def random_contrast(img):
    prob = np.random.uniform(0, 1)
    if prob < cfg.contrast_prob:
        delta = np.random.uniform(-cfg.contrast_delta,
                                  cfg.contrast_delta) + 1
        img = ImageEnhance.Contrast(img).enhance(delta)
    return img
 def random_saturation(img):
    prob = np.random.uniform(0, 1)
    if prob < cfg.saturation_prob:
        delta = np.random.uniform(-cfg.saturation_delta,
                                  cfg.saturation_delta) + 1
        img = ImageEnhance.Color(img).enhance(delta)
    return img
 def random_hue(img):
    prob = np.random.uniform(0, 1)
    if prob < cfg.hue_prob:
        delta = np.random.uniform(-cfg.hue_delta, cfg.hue_delta)
        img_hsv = np.array(img.convert('HSV'))
        img_hsv[:, :, 0] = img_hsv[:, :, 0] + delta
        img = Image.fromarray(img_hsv, mode='HSV').convert('RGB')
    return img
 def distort_image(img):
    prob = np.random.uniform(0, 1)
    # Apply different distort order
    if prob > 0.5:
        img = random_brightness(img)
        img = random_contrast(img)
        img = random_saturation(img)
        img = random_hue(img)
    else:
        img = random_brightness(img)
        img = random_saturation(img)
        img = random_hue(img)
        img = random_contrast(img)
    return img
 def meet_emit_constraint(src_bbox, sample_bbox):
    center_x = (src_bbox.xmax + src_bbox.xmin) / 2
    center_y = (src_bbox.ymax + src_bbox.ymin) / 2
    if center_x >= sample_bbox.xmin and \
            center_x <= sample_bbox.xmax and \
            center_y >= sample_bbox.ymin and \
            center_y <= sample_bbox.ymax:
        return True
    return False
 def project_bbox(object_bbox, sample_bbox):
    if object_bbox.xmin >= sample_bbox.xmax or \
       object_bbox.xmax <= sample_bbox.xmin or \
       object_bbox.ymin >= sample_bbox.ymax or \
       object_bbox.ymax <= sample_bbox.ymin:
        return False
    else:
        proj_bbox = bbox(0, 0, 0, 0)
        sample_width = sample_bbox.xmax - sample_bbox.xmin
        sample_height = sample_bbox.ymax - sample_bbox.ymin
        proj_bbox.xmin = (object_bbox.xmin - sample_bbox.xmin) / sample_width
        proj_bbox.ymin = (object_bbox.ymin - sample_bbox.ymin) / sample_height
        proj_bbox.xmax = (object_bbox.xmax - sample_bbox.xmin) / sample_width
        proj_bbox.ymax = (object_bbox.ymax - sample_bbox.ymin) / sample_height
        proj_bbox = clip_bbox(proj_bbox)
        if bbox_area(proj_bbox) > 0:
            return proj_bbox
        else:
            return False
 def transform_labels(bbox_labels, sample_bbox):
    sample_labels = []
    for i in range(len(bbox_labels)):
        sample_label = []
        object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
                           bbox_labels[i][3], bbox_labels[i][4])
        if not meet_emit_constraint(object_bbox, sample_bbox):
            continue
        proj_bbox = project_bbox(object_bbox, sample_bbox)
        if proj_bbox:
            sample_label.append(bbox_labels[i][0])
            sample_label.append(float(proj_bbox.xmin))
            sample_label.append(float(proj_bbox.ymin))
            sample_label.append(float(proj_bbox.xmax))
            sample_label.append(float(proj_bbox.ymax))
            sample_label = sample_label + bbox_labels[i][5:]
            sample_labels.append(sample_label)
    return sample_labels
 def expand_image(img, bbox_labels, img_width, img_height):
    prob = np.random.uniform(0, 1)
    if prob < cfg.expand_prob:
        if cfg.expand_max_ratio - 1 >= 0.01:
            expand_ratio = np.random.uniform(1, cfg.expand_max_ratio)
            height = int(img_height * expand_ratio)
            width = int(img_width * expand_ratio)
            h_off = math.floor(np.random.uniform(0, height - img_height))
            w_off = math.floor(np.random.uniform(0, width - img_width))
            expand_bbox = bbox(-w_off / img_width, -h_off / img_height,
                               (width - w_off) / img_width,
                               (height - h_off) / img_height)
            expand_img = np.ones((height, width, 3))
            expand_img = np.uint8(expand_img * np.squeeze(cfg.img_mean))
            expand_img = Image.fromarray(expand_img)
            expand_img.paste(img, (int(w_off), int(h_off)))
            bbox_labels = transform_labels(bbox_labels, expand_bbox)
            return expand_img, bbox_labels, width, height
    return img, bbox_labels, img_width, img_height
 def clip_bbox(src_bbox):
    src_bbox.xmin = max(min(src_bbox.xmin, 1.0), 0.0)
    src_bbox.ymin = max(min(src_bbox.ymin, 1.0), 0.0)
    src_bbox.xmax = max(min(src_bbox.xmax, 1.0), 0.0)
    src_bbox.ymax = max(min(src_bbox.ymax, 1.0), 0.0)
    return src_bbox
 def bbox_area(src_bbox):
    if src_bbox.xmax < src_bbox.xmin or src_bbox.ymax < src_bbox.ymin:
        return 0.
    else:
        width = src_bbox.xmax - src_bbox.xmin
        height = src_bbox.ymax - src_bbox.ymin
        return width * height
 def intersect_bbox(bbox1, bbox2):
    if bbox2.xmin > bbox1.xmax or bbox2.xmax < bbox1.xmin or \
            bbox2.ymin > bbox1.ymax or bbox2.ymax < bbox1.ymin:
        intersection_box = bbox(0.0, 0.0, 0.0, 0.0)
    else:
        intersection_box = bbox(
            max(bbox1.xmin, bbox2.xmin),
            max(bbox1.ymin, bbox2.ymin),
            min(bbox1.xmax, bbox2.xmax), min(bbox1.ymax, bbox2.ymax))
    return intersection_box
 def bbox_coverage(bbox1, bbox2):
    inter_box = intersect_bbox(bbox1, bbox2)
    intersect_size = bbox_area(inter_box)
    if intersect_size > 0:
        bbox1_size = bbox_area(bbox1)
        return intersect_size / bbox1_size
    else:
        return 0.
 def generate_batch_random_samples(batch_sampler, bbox_labels, image_width,
                                  image_height, scale_array, resize_width,
                                  resize_height):
    sampled_bbox = []
    for sampler in batch_sampler:
        found = 0
        for i in range(sampler.max_trial):
            if found >= sampler.max_sample:
                break
            sample_bbox = data_anchor_sampling(
                sampler, bbox_labels, image_width, image_height, scale_array,
                resize_width, resize_height)
            if sample_bbox == 0:
                break
            if satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
                sampled_bbox.append(sample_bbox)
                found = found + 1
    return sampled_bbox
 def data_anchor_sampling(sampler, bbox_labels, image_width, image_height,
                         scale_array, resize_width, resize_height):
    num_gt = len(bbox_labels)
    # np.random.randint range: [low, high)
    rand_idx = np.random.randint(0, num_gt) if num_gt != 0 else 0
    if num_gt != 0:
        norm_xmin = bbox_labels[rand_idx][1]
        norm_ymin = bbox_labels[rand_idx][2]
        norm_xmax = bbox_labels[rand_idx][3]
        norm_ymax = bbox_labels[rand_idx][4]
        xmin = norm_xmin * image_width
        ymin = norm_ymin * image_height
        wid = image_width * (norm_xmax - norm_xmin)
        hei = image_height * (norm_ymax - norm_ymin)
        range_size = 0
        area = wid * hei
        for scale_ind in range(0, len(scale_array) - 1):
            if area > scale_array[scale_ind] ** 2 and area < \
                    scale_array[scale_ind + 1] ** 2:
                range_size = scale_ind + 1
                break
        if area > scale_array[len(scale_array) - 2]**2:
            range_size = len(scale_array) - 2
        scale_choose = 0.0
        if range_size == 0:
            rand_idx_size = 0
        else:
            # np.random.randint range: [low, high)
            rng_rand_size = np.random.randint(0, range_size + 1)
            rand_idx_size = rng_rand_size % (range_size + 1)
        if rand_idx_size == range_size:
            min_resize_val = scale_array[rand_idx_size] / 2.0
            max_resize_val = min(2.0 * scale_array[rand_idx_size],
                                 2 * math.sqrt(wid * hei))
            scale_choose = random.uniform(min_resize_val, max_resize_val)
        else:
            min_resize_val = scale_array[rand_idx_size] / 2.0
            max_resize_val = 2.0 * scale_array[rand_idx_size]
            scale_choose = random.uniform(min_resize_val, max_resize_val)
        sample_bbox_size = wid * resize_width / scale_choose
        w_off_orig = 0.0
        h_off_orig = 0.0
        if sample_bbox_size < max(image_height, image_width):
            if wid <= sample_bbox_size:
                w_off_orig = np.random.uniform(xmin + wid - sample_bbox_size,
                                               xmin)
            else:
                w_off_orig = np.random.uniform(xmin,
                                               xmin + wid - sample_bbox_size)
            if hei <= sample_bbox_size:
                h_off_orig = np.random.uniform(ymin + hei - sample_bbox_size,
                                               ymin)
            else:
                h_off_orig = np.random.uniform(ymin,
                                               ymin + hei - sample_bbox_size)
        else:
            w_off_orig = np.random.uniform(image_width - sample_bbox_size, 0.0)
            h_off_orig = np.random.uniform(
                image_height - sample_bbox_size, 0.0)
        w_off_orig = math.floor(w_off_orig)
        h_off_orig = math.floor(h_off_orig)
        # Figure out top left coordinates.
        w_off = 0.0
        h_off = 0.0
        w_off = float(w_off_orig / image_width)
        h_off = float(h_off_orig / image_height)
        sampled_bbox = bbox(w_off, h_off,
                            w_off + float(sample_bbox_size / image_width),
                            h_off + float(sample_bbox_size / image_height))
        return sampled_bbox
    else:
        return 0
 def jaccard_overlap(sample_bbox, object_bbox):
    if sample_bbox.xmin >= object_bbox.xmax or \
            sample_bbox.xmax <= object_bbox.xmin or \
            sample_bbox.ymin >= object_bbox.ymax or \
            sample_bbox.ymax <= object_bbox.ymin:
        return 0
    intersect_xmin = max(sample_bbox.xmin, object_bbox.xmin)
    intersect_ymin = max(sample_bbox.ymin, object_bbox.ymin)
    intersect_xmax = min(sample_bbox.xmax, object_bbox.xmax)
    intersect_ymax = min(sample_bbox.ymax, object_bbox.ymax)
    intersect_size = (intersect_xmax - intersect_xmin) * (
        intersect_ymax - intersect_ymin)
    sample_bbox_size = bbox_area(sample_bbox)
    object_bbox_size = bbox_area(object_bbox)
    overlap = intersect_size / (
        sample_bbox_size + object_bbox_size - intersect_size)
    return overlap
 def satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
    if sampler.min_jaccard_overlap == 0 and sampler.max_jaccard_overlap == 0:
        has_jaccard_overlap = False
    else:
        has_jaccard_overlap = True
    if sampler.min_object_coverage == 0 and sampler.max_object_coverage == 0:
        has_object_coverage = False
    else:
        has_object_coverage = True
    if not has_jaccard_overlap and not has_object_coverage:
        return True
    found = False
    for i in range(len(bbox_labels)):
        object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
                           bbox_labels[i][3], bbox_labels[i][4])
        if has_jaccard_overlap:
            overlap = jaccard_overlap(sample_bbox, object_bbox)
            if sampler.min_jaccard_overlap != 0 and \
                    overlap < sampler.min_jaccard_overlap:
                continue
            if sampler.max_jaccard_overlap != 0 and \
                    overlap > sampler.max_jaccard_overlap:
                continue
            found = True
        if has_object_coverage:
            object_coverage = bbox_coverage(object_bbox, sample_bbox)
            if sampler.min_object_coverage != 0 and \
                    object_coverage < sampler.min_object_coverage:
                continue
            if sampler.max_object_coverage != 0 and \
                    object_coverage > sampler.max_object_coverage:
                continue
            found = True
        if found:
            return True
    return found
 def crop_image_sampling(img, bbox_labels, sample_bbox, image_width,
                        image_height, resize_width, resize_height,
                        min_face_size):
    # no clipping here
    xmin = int(sample_bbox.xmin * image_width)
    xmax = int(sample_bbox.xmax * image_width)
    ymin = int(sample_bbox.ymin * image_height)
    ymax = int(sample_bbox.ymax * image_height)
    w_off = xmin
    h_off = ymin
    width = xmax - xmin
    height = ymax - ymin
    cross_xmin = max(0.0, float(w_off))
    cross_ymin = max(0.0, float(h_off))
    cross_xmax = min(float(w_off + width - 1.0), float(image_width))
    cross_ymax = min(float(h_off + height - 1.0), float(image_height))
    cross_width = cross_xmax - cross_xmin
    cross_height = cross_ymax - cross_ymin
    roi_xmin = 0 if w_off >= 0 else abs(w_off)
    roi_ymin = 0 if h_off >= 0 else abs(h_off)
    roi_width = cross_width
    roi_height = cross_height
    roi_y1 = int(roi_ymin)
    roi_y2 = int(roi_ymin + roi_height)
    roi_x1 = int(roi_xmin)
    roi_x2 = int(roi_xmin + roi_width)
    cross_y1 = int(cross_ymin)
    cross_y2 = int(cross_ymin + cross_height)
    cross_x1 = int(cross_xmin)
    cross_x2 = int(cross_xmin + cross_width)
    sample_img = np.zeros((height, width, 3))
    # print(sample_img.shape)
    sample_img[roi_y1 : roi_y2, roi_x1 : roi_x2] = \
        img[cross_y1: cross_y2, cross_x1: cross_x2]
    sample_img = cv2.resize(
        sample_img, (resize_width, resize_height), interpolation=cv2.INTER_AREA)
    resize_val = resize_width
    sample_labels = transform_labels_sampling(bbox_labels, sample_bbox,
                                              resize_val, min_face_size)
    return sample_img, sample_labels
 def transform_labels_sampling(bbox_labels, sample_bbox, resize_val,
                              min_face_size):
    sample_labels = []
    for i in range(len(bbox_labels)):
        sample_label = []
        object_bbox = bbox(bbox_labels[i][1], bbox_labels[i][2],
                           bbox_labels[i][3], bbox_labels[i][4])
        if not meet_emit_constraint(object_bbox, sample_bbox):
            continue
        proj_bbox = project_bbox(object_bbox, sample_bbox)
        if proj_bbox:
            real_width = float((proj_bbox.xmax - proj_bbox.xmin) * resize_val)
            real_height = float((proj_bbox.ymax - proj_bbox.ymin) * resize_val)
            if real_width * real_height < float(min_face_size * min_face_size):
                continue
            else:
                sample_label.append(bbox_labels[i][0])
                sample_label.append(float(proj_bbox.xmin))
                sample_label.append(float(proj_bbox.ymin))
                sample_label.append(float(proj_bbox.xmax))
                sample_label.append(float(proj_bbox.ymax))
                sample_label = sample_label + bbox_labels[i][5:]
                sample_labels.append(sample_label)
    return sample_labels
 def generate_sample(sampler, image_width, image_height):
    scale = np.random.uniform(sampler.min_scale, sampler.max_scale)
    aspect_ratio = np.random.uniform(sampler.min_aspect_ratio,
                                     sampler.max_aspect_ratio)
    aspect_ratio = max(aspect_ratio, (scale**2.0))
    aspect_ratio = min(aspect_ratio, 1 / (scale**2.0))
    bbox_width = scale * (aspect_ratio**0.5)
    bbox_height = scale / (aspect_ratio**0.5)
    # guarantee a squared image patch after cropping
    if sampler.use_square:
        if image_height < image_width:
            bbox_width = bbox_height * image_height / image_width
        else:
            bbox_height = bbox_width * image_width / image_height
    xmin_bound = 1 - bbox_width
    ymin_bound = 1 - bbox_height
    xmin = np.random.uniform(0, xmin_bound)
    ymin = np.random.uniform(0, ymin_bound)
    xmax = xmin + bbox_width
    ymax = ymin + bbox_height
    sampled_bbox = bbox(xmin, ymin, xmax, ymax)
    return sampled_bbox
 def generate_batch_samples(batch_sampler, bbox_labels, image_width,
                           image_height):
    sampled_bbox = []
    for sampler in batch_sampler:
        found = 0
        for i in range(sampler.max_trial):
            if found >= sampler.max_sample:
                break
            sample_bbox = generate_sample(sampler, image_width, image_height)
            if satisfy_sample_constraint(sampler, sample_bbox, bbox_labels):
                sampled_bbox.append(sample_bbox)
                found = found + 1
    return sampled_bbox
 def crop_image(img, bbox_labels, sample_bbox, image_width, image_height,
               resize_width, resize_height, min_face_size):
    sample_bbox = clip_bbox(sample_bbox)
    xmin = int(sample_bbox.xmin * image_width)
    xmax = int(sample_bbox.xmax * image_width)
    ymin = int(sample_bbox.ymin * image_height)
    ymax = int(sample_bbox.ymax * image_height)
    sample_img = img[ymin:ymax, xmin:xmax]
    resize_val = resize_width
    sample_labels = transform_labels_sampling(bbox_labels, sample_bbox,
                                              resize_val, min_face_size)
    return sample_img, sample_labels
 def to_chw_bgr(image):
    """
    Transpose image from HWC to CHW and from RBG to BGR.
    Args:
        image (np.array): an image with HWC and RBG layout.
    """
    # HWC to CHW
    if len(image.shape) == 3:
        image = np.swapaxes(image, 1, 2)
        image = np.swapaxes(image, 1, 0)
    # RBG to BGR
    image = image[[2, 1, 0], :, :]
    return image
 def anchor_crop_image_sampling(img,
                               bbox_labels,
                               scale_array,
                               img_width,
                               img_height):
    mean = np.array([104, 117, 123], dtype=np.float32)
    maxSize = 12000  # max size
    infDistance = 9999999
    bbox_labels = np.array(bbox_labels)
    scale = np.array([img_width, img_height, img_width, img_height])
    boxes = bbox_labels[:, 1:5] * scale
    labels = bbox_labels[:, 0]
    boxArea = (boxes[:, 2] - boxes[:, 0] + 1) * (boxes[:, 3] - boxes[:, 1] + 1)
    # argsort = np.argsort(boxArea)
    # rand_idx = random.randint(min(len(argsort),6))
    # print('rand idx',rand_idx)
    rand_idx = np.random.randint(len(boxArea))
    rand_Side = boxArea[rand_idx] ** 0.5
    # rand_Side = min(boxes[rand_idx,2] - boxes[rand_idx,0] + 1,
    # boxes[rand_idx,3] - boxes[rand_idx,1] + 1)
    distance = infDistance
    anchor_idx = 5
    for i, anchor in enumerate(scale_array):
        if abs(anchor - rand_Side) < distance:
            distance = abs(anchor - rand_Side)
            anchor_idx = i
    target_anchor = random.choice(scale_array[0:min(anchor_idx + 1, 5) + 1])
    ratio = float(target_anchor) / rand_Side
    ratio = ratio * (2**random.uniform(-1, 1))
    if int(img_height * ratio * img_width * ratio) > maxSize * maxSize:
        ratio = (maxSize * maxSize / (img_height * img_width))**0.5
    interp_methods = [cv2.INTER_LINEAR, cv2.INTER_CUBIC,
                      cv2.INTER_AREA, cv2.INTER_NEAREST, cv2.INTER_LANCZOS4]
    interp_method = random.choice(interp_methods)
    image = cv2.resize(img, None, None, fx=ratio,
                       fy=ratio, interpolation=interp_method)
    boxes[:, 0] *= ratio
    boxes[:, 1] *= ratio
    boxes[:, 2] *= ratio
    boxes[:, 3] *= ratio
    height, width, _ = image.shape
    sample_boxes = []
    xmin = boxes[rand_idx, 0]
    ymin = boxes[rand_idx, 1]
    bw = (boxes[rand_idx, 2] - boxes[rand_idx, 0] + 1)
    bh = (boxes[rand_idx, 3] - boxes[rand_idx, 1] + 1)
    w = h = cfg.INPUT_SIZE
    for _ in range(50):
        if w < max(height, width):
            if bw <= w:
                w_off = random.uniform(xmin + bw - w, xmin)
            else:
                w_off = random.uniform(xmin, xmin + bw - w)
            if bh <= h:
                h_off = random.uniform(ymin + bh - h, ymin)
            else:
                h_off = random.uniform(ymin, ymin + bh - h)
        else:
            w_off = random.uniform(width - w, 0)
            h_off = random.uniform(height - h, 0)
        w_off = math.floor(w_off)
        h_off = math.floor(h_off)
        # convert to integer rect x1,y1,x2,y2
        rect = np.array(
            [int(w_off), int(h_off), int(w_off + w), int(h_off + h)])
        # keep overlap with gt box IF center in sampled patch
        centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0
        # mask in all gt boxes that above and to the left of centers
        m1 = (rect[0] <= boxes[:, 0]) * (rect[1] <= boxes[:, 1])
        # mask in all gt boxes that under and to the right of centers
        m2 = (rect[2] >= boxes[:, 2]) * (rect[3] >= boxes[:, 3])
        # mask in that both m1 and m2 are true
        mask = m1 * m2
        overlap = jaccard_numpy(boxes, rect)
        # have any valid boxes? try again if not
        if not mask.any() and not overlap.max() > 0.7:
            continue
        else:
            sample_boxes.append(rect)
    sampled_labels = []
    if len(sample_boxes) > 0:
        choice_idx = np.random.randint(len(sample_boxes))
        choice_box = sample_boxes[choice_idx]
        # print('crop the box :',choice_box)
        centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0
        m1 = (choice_box[0] < centers[:, 0]) * \
            (choice_box[1] < centers[:, 1])
        m2 = (choice_box[2] > centers[:, 0]) * \
            (choice_box[3] > centers[:, 1])
        mask = m1 * m2
        current_boxes = boxes[mask, :].copy()
        current_labels = labels[mask]
        current_boxes[:, :2] -= choice_box[:2]
        current_boxes[:, 2:] -= choice_box[:2]
        if choice_box[0] < 0 or choice_box[1] < 0:
            new_img_width = width if choice_box[
                0] >= 0 else width - choice_box[0]
            new_img_height = height if choice_box[
                1] >= 0 else height - choice_box[1]
            image_pad = np.zeros(
                (new_img_height, new_img_width, 3), dtype=float)
            image_pad[:, :, :] = mean
            start_left = 0 if choice_box[0] >= 0 else -choice_box[0]
            start_top = 0 if choice_box[1] >= 0 else -choice_box[1]
            image_pad[start_top:, start_left:, :] = image
            choice_box_w = choice_box[2] - choice_box[0]
            choice_box_h = choice_box[3] - choice_box[1]
            start_left = choice_box[0] if choice_box[0] >= 0 else 0
            start_top = choice_box[1] if choice_box[1] >= 0 else 0
            end_right = start_left + choice_box_w
            end_bottom = start_top + choice_box_h
            current_image = image_pad[
                start_top:end_bottom, start_left:end_right, :].copy()
            image_height, image_width, _ = current_image.shape
            if cfg.filter_min_face:
                bbox_w = current_boxes[:, 2] - current_boxes[:, 0]
                bbox_h = current_boxes[:, 3] - current_boxes[:, 1]
                bbox_area = bbox_w * bbox_h
                mask = bbox_area > (cfg.min_face_size * cfg.min_face_size)
                current_boxes = current_boxes[mask]
                current_labels = current_labels[mask]
                for i in range(len(current_boxes)):
                    sample_label = []
                    sample_label.append(current_labels[i])
                    sample_label.append(current_boxes[i][0] / image_width)
                    sample_label.append(current_boxes[i][1] / image_height)
                    sample_label.append(current_boxes[i][2] / image_width)
                    sample_label.append(current_boxes[i][3] / image_height)
                    sampled_labels += [sample_label]
                sampled_labels = np.array(sampled_labels)
            else:
                current_boxes /= np.array([image_width,
                                           image_height, image_width, image_height])
                sampled_labels = np.hstack(
                    (current_labels[:, np.newaxis], current_boxes))
            return current_image, sampled_labels
        current_image = image[choice_box[1]:choice_box[
            3], choice_box[0]:choice_box[2], :].copy()
        image_height, image_width, _ = current_image.shape
        if cfg.filter_min_face:
            bbox_w = current_boxes[:, 2] - current_boxes[:, 0]
            bbox_h = current_boxes[:, 3] - current_boxes[:, 1]
            bbox_area = bbox_w * bbox_h
            mask = bbox_area > (cfg.min_face_size * cfg.min_face_size)
            current_boxes = current_boxes[mask]
            current_labels = current_labels[mask]
            for i in range(len(current_boxes)):
                sample_label = []
                sample_label.append(current_labels[i])
                sample_label.append(current_boxes[i][0] / image_width)
                sample_label.append(current_boxes[i][1] / image_height)
                sample_label.append(current_boxes[i][2] / image_width)
                sample_label.append(current_boxes[i][3] / image_height)
                sampled_labels += [sample_label]
            sampled_labels = np.array(sampled_labels)
        else:
            current_boxes /= np.array([image_width,
                                       image_height, image_width, image_height])
            sampled_labels = np.hstack(
                (current_labels[:, np.newaxis], current_boxes))
        return current_image, sampled_labels
    else:
        image_height, image_width, _ = image.shape
        if cfg.filter_min_face:
            bbox_w = boxes[:, 2] - boxes[:, 0]
            bbox_h = boxes[:, 3] - boxes[:, 1]
            bbox_area = bbox_w * bbox_h
            mask = bbox_area > (cfg.min_face_size * cfg.min_face_size)
            boxes = boxes[mask]
            labels = labels[mask]
            for i in range(len(boxes)):
                sample_label = []
                sample_label.append(labels[i])
                sample_label.append(boxes[i][0] / image_width)
                sample_label.append(boxes[i][1] / image_height)
                sample_label.append(boxes[i][2] / image_width)
                sample_label.append(boxes[i][3] / image_height)
                sampled_labels += [sample_label]
            sampled_labels = np.array(sampled_labels)
        else:
            boxes /= np.array([image_width, image_height,
                               image_width, image_height])
            sampled_labels = np.hstack(
                (labels[:, np.newaxis], boxes))
        return image, sampled_labels
 def preprocess(img, bbox_labels, mode, image_path):
    img_width, img_height = img.size
    sampled_labels = bbox_labels
    if mode == 'train':
        if cfg.apply_distort:
            img = distort_image(img)
        if cfg.apply_expand:
            img, bbox_labels, img_width, img_height = expand_image(
                img, bbox_labels, img_width, img_height)
        batch_sampler = []
        prob = np.random.uniform(0., 1.)
        if prob > cfg.data_anchor_sampling_prob and cfg.anchor_sampling:
            scale_array = np.array(cfg.ANCHOR_SIZES)#[16, 32, 64, 128, 256, 512])
            '''
            batch_sampler.append(
                sampler(1, 50, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.6, 0.0, True))
            sampled_bbox = generate_batch_random_samples(
                batch_sampler, bbox_labels, img_width, img_height, scale_array,
                cfg.resize_width, cfg.resize_height)
            '''
            img = np.array(img)
            img, sampled_labels = anchor_crop_image_sampling(
                img, bbox_labels, scale_array, img_width, img_height)
            '''
            if len(sampled_bbox) > 0:
                idx = int(np.random.uniform(0, len(sampled_bbox)))
                img, sampled_labels = crop_image_sampling(
                    img, bbox_labels, sampled_bbox[idx], img_width, img_height,
                    cfg.resize_width, cfg.resize_height, cfg.min_face_size)
            '''
            img = img.astype('uint8')
            img = Image.fromarray(img)
        else:
            batch_sampler.append(sampler(1, 50, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0,
                                         0.0, True))
            batch_sampler.append(sampler(1, 50, 0.3, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0,
                                         0.0, True))
            batch_sampler.append(sampler(1, 50, 0.3, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0,
                                         0.0, True))
            batch_sampler.append(sampler(1, 50, 0.3, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0,
                                         0.0, True))
            batch_sampler.append(sampler(1, 50, 0.3, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0,
                                         0.0, True))
            sampled_bbox = generate_batch_samples(
                batch_sampler, bbox_labels, img_width, img_height)
            img = np.array(img)
            if len(sampled_bbox) > 0:
                idx = int(np.random.uniform(0, len(sampled_bbox)))
                img, sampled_labels = crop_image(
                    img, bbox_labels, sampled_bbox[idx], img_width, img_height,
                    cfg.resize_width, cfg.resize_height, cfg.min_face_size)
            img = Image.fromarray(img)
    interp_mode = [
        Image.BILINEAR, Image.HAMMING, Image.NEAREST, Image.BICUBIC,
        Image.LANCZOS
    ]
    interp_indx = np.random.randint(0, 5)
    img = img.resize((cfg.resize_width, cfg.resize_height),
                     resample=interp_mode[interp_indx])
    img = np.array(img)
    if mode == 'train':
        mirror = int(np.random.uniform(0, 2))
        if mirror == 1:
            img = img[:, ::-1, :]
            for i in six.moves.xrange(len(sampled_labels)):
                tmp = sampled_labels[i][1]
                sampled_labels[i][1] = 1 - sampled_labels[i][3]
                sampled_labels[i][3] = 1 - tmp
    #img = Image.fromarray(img)
    img = to_chw_bgr(img)
    img = img.astype('float32')
    img -= cfg.img_mean
    img = img[[2, 1, 0], :, :]  # to RGB
    #img = img * cfg.scale
    return img, sampled_labels
--- a/examples/pytorch/vision/Face_Detection/wider_test.py
+++ b/examples/pytorch/vision/Face_Detection/wider_test.py
@ -0,0 +1,272 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import os
 import argparse
 import torch
 import torch.nn as nn
 import torch.utils.data as data
 import torch.backends.cudnn as cudnn
 import torchvision.transforms as transforms
 import os.path as osp
 import cv2
 import time
 import numpy as np
 from PIL import Image
 import scipy.io as sio
 from data.choose_config import cfg
 cfg = cfg.cfg
 from torch.autograd import Variable
 from utils.augmentations import to_chw_bgr
 from importlib import import_module
 import warnings
 warnings.filterwarnings("ignore")
 parser = argparse.ArgumentParser(description='s3fd evaluatuon wider')
 parser.add_argument('--model', type=str,
                    default='./weights/rpool_face_c.pth', help='trained model')
 parser.add_argument('--thresh', default=0.05, type=float,
                    help='Final confidence threshold')
 parser.add_argument('--model_arch',
                    default='RPool_Face_C', type=str,
                    choices=['RPool_Face_C', 'RPool_Face_Quant', 'RPool_Face_QVGA_monochrome'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--save_folder', type=str,
                    default='rpool_face_predictions', help='folder for saving predictions')
 parser.add_argument('--subset', type=str,
                    default='val',
                    choices=['val', 'test'],
                    help='choose which set to run testing on')
 args = parser.parse_args()
 use_cuda = torch.cuda.is_available()
 if use_cuda:
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
 else:
    torch.set_default_tensor_type('torch.FloatTensor')
 def detect_face(net, img, shrink):
    if shrink != 1:
        img = cv2.resize(img, None, None, fx=shrink, fy=shrink,
                         interpolation=cv2.INTER_LINEAR)
    x = to_chw_bgr(img)
    x = x.astype('float32')
    x -= cfg.img_mean
    x = x[[2, 1, 0], :, :]
    if cfg.IS_MONOCHROME == True:
        x = 0.299 * x[0] + 0.587 * x[1] + 0.114 * x[2]
        x = torch.from_numpy(x).unsqueeze(0).unsqueeze(0)
    else:
        x = torch.from_numpy(x).unsqueeze(0)
    if use_cuda:
        x = x.cuda()
    y = net(x)
    detections = y.data
    detections = detections.cpu().numpy()
    det_conf = detections[0, 1, :, 0]
    det_xmin = img.shape[1] * detections[0, 1, :, 1] / shrink
    det_ymin = img.shape[0] * detections[0, 1, :, 2] / shrink
    det_xmax = img.shape[1] * detections[0, 1, :, 3] / shrink
    det_ymax = img.shape[0] * detections[0, 1, :, 4] / shrink
    det = np.column_stack((det_xmin, det_ymin, det_xmax, det_ymax, det_conf))
    keep_index = np.where(det[:, 4] >= args.thresh)[0]
    det = det[keep_index, :]
    return det
 def multi_scale_test(net, image, max_im_shrink):
    # shrink detecting and shrink only detect big face
    st = 0.5 if max_im_shrink >= 0.75 else 0.5 * max_im_shrink
    det_s = detect_face(net, image, st)
    index = np.where(np.maximum(
        det_s[:, 2] - det_s[:, 0] + 1, det_s[:, 3] - det_s[:, 1] + 1) > 30)[0]
    det_s = det_s[index, :]
    # enlarge one times
    bt = min(2, max_im_shrink) if max_im_shrink > 1 else (
        st + max_im_shrink) / 2
    det_b = detect_face(net, image, bt)
    # enlarge small image x times for small face
    if max_im_shrink > 2:
        bt *= 2
        while bt < max_im_shrink:
            det_b = np.row_stack((det_b, detect_face(net, image, bt)))
            bt *= 2
        det_b = np.row_stack((det_b, detect_face(net, image, max_im_shrink)))
    # enlarge only detect small face
    if bt > 1:
        index = np.where(np.minimum(
            det_b[:, 2] - det_b[:, 0] + 1, det_b[:, 3] - det_b[:, 1] + 1) < 100)[0]
        det_b = det_b[index, :]
    else:
        index = np.where(np.maximum(
            det_b[:, 2] - det_b[:, 0] + 1, det_b[:, 3] - det_b[:, 1] + 1) > 30)[0]
        det_b = det_b[index, :]
    return det_s, det_b
 def flip_test(net, image, shrink):
    image_f = cv2.flip(image, 1)
    det_f = detect_face(net, image_f, shrink)
    det_t = np.zeros(det_f.shape)
    det_t[:, 0] = image.shape[1] - det_f[:, 2]
    det_t[:, 1] = det_f[:, 1]
    det_t[:, 2] = image.shape[1] - det_f[:, 0]
    det_t[:, 3] = det_f[:, 3]
    det_t[:, 4] = det_f[:, 4]
    return det_t
 def bbox_vote(det):
    order = det[:, 4].ravel().argsort()[::-1]
    det = det[order, :]
    while det.shape[0] > 0:
        # IOU
        area = (det[:, 2] - det[:, 0] + 1) * (det[:, 3] - det[:, 1] + 1)
        xx1 = np.maximum(det[0, 0], det[:, 0])
        yy1 = np.maximum(det[0, 1], det[:, 1])
        xx2 = np.minimum(det[0, 2], det[:, 2])
        yy2 = np.minimum(det[0, 3], det[:, 3])
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h
        o = inter / (area[0] + area[:] - inter)
        # get needed merge det and delete these det
        merge_index = np.where(o >= 0.3)[0]
        det_accu = det[merge_index, :]
        det = np.delete(det, merge_index, 0)
        if merge_index.shape[0] <= 1:
            continue
        det_accu[:, 0:4] = det_accu[:, 0:4] * np.tile(det_accu[:, -1:], (1, 4))
        max_score = np.max(det_accu[:, 4])
        det_accu_sum = np.zeros((1, 5))
        det_accu_sum[:, 0:4] = np.sum(
            det_accu[:, 0:4], axis=0) / np.sum(det_accu[:, -1:])
        det_accu_sum[:, 4] = max_score
        try:
            dets = np.row_stack((dets, det_accu_sum))
        except:
            dets = det_accu_sum
    dets = dets[0:750, :]
    return dets
 def get_data():
    subset = args.subset
    WIDER_ROOT = os.path.join(cfg.HOME, 'WIDER_FACE')
    if subset == 'val':
        wider_face = sio.loadmat(
            os.path.join(WIDER_ROOT, 'wider_face_split',
                               'wider_face_val.mat'))
    else:
        wider_face = sio.loadmat(
            os.path.join(WIDER_ROOT, 'wider_face_split',
                               'wider_face_test.mat'))
    event_list = wider_face['event_list']
    file_list = wider_face['file_list']
    del wider_face
    imgs_path = os.path.join(
        cfg.FACE.WIDER_DIR, 'WIDER_{}'.format(subset), 'images')
    save_path = './{}'.format(args.save_folder)
    return event_list, file_list, imgs_path, save_path
 if __name__ == '__main__':
    event_list, file_list, imgs_path, save_path = get_data()
    cfg.USE_NMS = False
    module = import_module('models.' + args.model_arch)
    net = module.build_s3fd('test', cfg.NUM_CLASSES)
    net = torch.nn.DataParallel(net)
    checkpoint_dict = torch.load(args.model)
    model_dict = net.state_dict()
    model_dict.update(checkpoint_dict) 
    net.load_state_dict(model_dict)
    net.eval()
    if use_cuda:
        net.cuda()
        cudnn.benckmark = True
    counter = 0
    for index, event in enumerate(event_list):
        filelist = file_list[index][0]
        path = os.path.join(save_path, str(event[0][0]))#.encode('utf-8'))
        if not os.path.exists(path):
            os.makedirs(path)
        for num, file in enumerate(filelist):
            im_name = str(file[0][0])#.encode('utf-8')
            in_file = os.path.join(imgs_path, event[0][0], im_name[:] + '.jpg')
            img = Image.open(in_file)
            if img.mode == 'L':
                img = img.convert('RGB')
            img = np.array(img)
            max_im_shrink = np.sqrt(
                1700 * 1200 / (img.shape[0] * img.shape[1]))
            shrink = max_im_shrink if max_im_shrink < 1 else 1
            counter += 1
            t1 = time.time()
            det0 = detect_face(net, img, shrink)
            det1 = flip_test(net, img, shrink)    # flip test
            [det2, det3] = multi_scale_test(net, img, max_im_shrink)
            det = np.row_stack((det0, det1, det2, det3))
            dets = bbox_vote(det)
            t2 = time.time()
            print('Detect %04d th image costs %.4f' % (counter, t2 - t1))
            fout = open(osp.join(save_path, str(event[0][
                        0]), im_name + '.txt'), 'w')
            fout.write('{:s}\n'.format(str(event[0][0]) + '/' + im_name + '.jpg'))
            fout.write('{:d}\n'.format(dets.shape[0]))
            for i in range(dets.shape[0]):
                xmin = dets[i][0]
                ymin = dets[i][1]
                xmax = dets[i][2]
                ymax = dets[i][3]
                score = dets[i][4]
                fout.write('{:.1f} {:.1f} {:.1f} {:.1f} {:.3f}\n'.
                           format(xmin, ymin, (xmax - xmin + 1), (ymax - ymin + 1), score))
--- a/examples/pytorch/vision/Visual_Wakeword/README.md
+++ b/examples/pytorch/vision/Visual_Wakeword/README.md
@ -0,0 +1,71 @@
 # Code for Visual Wake Words experiments with RNNPool
 The Visual Wake Word challenge is a binary classification problem of detecting whether a person is present in 
 an image or not, as introduced by [Chowdhery et. al](https://arxiv.org/abs/1906.05721).
 ## Dataset
 The Visual Wake Words Dataset is derived from the publicly available [COCO](cocodataset.org/#/home) dataset. The Visual Wake Words Challenge evaluates accuracy on the [minival image ids](https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_minival_ids.txt),
 and for training uses the remaining 115k images of the COCO training/validation dataset. The process of creating the Visual Wake Words dataset from COCO dataset is as follows.
 Each image is assigned a label 1 or 0. 
 The label 1 is assigned as long as it has at least one bounding box corresponding 
 to the object of interest (e.g. person) with the box area greater than a certain threshold 
 (e.g. 0.5% of the image area).
 To download the COCO dataset use the script `download_coco.sh`
 ```bash
 bash scripts/download_mscoco.sh path-to-mscoco-dataset
 ```
 To create COCO annotation files that converts to the minival split use:
 `scripts/create_coco_train_minival_split.py`
 ```bash
 TRAIN_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_train2014.json"
 VAL_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_val2014.json"
 DIR="path-to-mscoco-dataset/annotations/"
 python scripts/create_coco_train_minival_split.py \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --output_dir="${DIR}"
 ```
 To generate the new annotations, use the script `scripts/create_visualwakewords_annotations.py`.
 ```bash
 MAXITRAIN_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_maxitrain.json"
 MINIVAL_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_minival.json"
 VWW_OUTPUT_DIR="new-path-to-visualwakewords-dataset/annotations/"
 python scripts/create_visualwakewords_annotations.py \
  --train_annotations_file="${MAXITRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${MINIVAL_ANNOTATIONS_FILE}" \
  --output_dir="${VWW_OUTPUT_DIR}" \
  --threshold=0.005 \
  --foreground_class='person'
 ```
 # Training
 ```bash
 python train_visualwakewords.py \
    --model_arch model_mobilenet_rnnpool \
    --lr 0.05 \
    --epochs 900 \
    --data "path-to-mscoco-dataset" \
    --ann "new-path-to-visualwakewords-dataset"
 ```
 Specify the paths used for storing MS COCO dataset and the Visual Wakeword dataset as used in dataset creation steps in --data and --ann respectively. This script should reach a validation accuracy of about 89.57 upon completion.
 # Evaluation
 ```bash
 python eval.py \
    --weights vww_rnnpool.pth \
    --model_arch model_mobilenet_rnnpool \
    --image_folder images \
 ```
 The weights argument is the saved checkpoint of the model trained with architecture which is passed in model_arch argument. The folder with images for evaluation has to be passed in image_folder argument. This script will print 'Person present' or 'No person present' for each image in the folder specified.
 Dataset creation code is from https://github.com/Mxbonn/visualwakewords/
--- a/examples/pytorch/vision/Visual_Wakeword/eval.py
+++ b/examples/pytorch/vision/Visual_Wakeword/eval.py
@ -0,0 +1,91 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import torch.nn.functional as F
 import torch.backends.cudnn as cudnn
 import torchvision
 import torchvision.transforms as transforms
 import os
 import argparse
 import random
 from PIL import Image
 import numpy as np
 from importlib import import_module
 import skimage
 from skimage import filters
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 torch.backends.cudnn.benchmark = True
 torch.backends.cudnn.enabled = True
 #Arg parser
 parser = argparse.ArgumentParser(description='PyTorch VisualWakeWords evaluation')
 parser.add_argument('--weights', default=None, type=str, help='load from checkpoint')
 parser.add_argument('--model_arch',
                    default='model_mobilenet_rnnpool', type=str,
                    choices=['model_mobilenet_rnnpool', 'model_mobilenet_2rnnpool'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--image_folder', default=None, type=str, help='folder containing images')
 args = parser.parse_args()
 normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
 transform_test = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
 ]) 
 if __name__ == '__main__':
    module = import_module(args.model_arch)
    model = module.mobilenetv2_rnnpool(num_classes=2, width_mult=0.35, last_channel=320)
    model = model.to(device)
    model = torch.nn.DataParallel(model)
    checkpoint = torch.load(args.weights)
    checkpoint_dict = checkpoint['model']
    model_dict = model.state_dict()
    model_dict.update(checkpoint_dict) 
    model.load_state_dict(model_dict)
    model.eval()
    img_path = args.image_folder
    img_list = [os.path.join(img_path, x)
                for x in os.listdir(img_path) if x.endswith('bmp')]
    for path in sorted(img_list):
        img = Image.open(path).convert('RGB')
        img = transform_test(img)
        img = (img.cuda())
        img = img.unsqueeze(0)
        out = model(img)
        print(path)
        print(out)
        if out[0][0]>0.15:
            print('No person present')
        else:
            print('Person present')
--- a/examples/pytorch/vision/Visual_Wakeword/model_mobilenet_2rnnpool.py
+++ b/examples/pytorch/vision/Visual_Wakeword/model_mobilenet_2rnnpool.py
@ -0,0 +1,208 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import re
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import numpy as np
 import torch.utils.checkpoint as cp
 from collections import OrderedDict
 from torchvision.models.utils import load_state_dict_from_url
 import sys; sys.path.append('..')
 from rnnpool import *
 __all__ = ['MobileNetV2', 'mobilenetv2_rnnpool']
 model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
 }
 def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes, momentum=0.01),
            nn.ReLU6(inplace=True)
        )
 class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            # pw
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            # dw
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            # pw-linear
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup, momentum=0.01),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)
 class MobileNetV2(nn.Module):
    def __init__(self, 
                 num_classes=1000, 
                 width_mult=0.5, 
                 inverted_residual_setting=None, 
                 round_nearest=8,
                 block=None,
                 last_channel = 1280):
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
            block: Module specifying inverted residual building block for mobilenet
        """
        super(MobileNetV2, self).__init__()
        if block is None:
            block = InvertedResidual
        input_channel = 8
        #last_channel = 1280
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s
                [6, 64, 4, 2],
                [6, 96, 3, 1],
                [6, 160, 3, 2],
                [6, 320, 1, 1],
            ]
        # only check the first element, assuming user knows t,c,n,s are required
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))
        # building first layer
        input_channel = _make_divisible(input_channel, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        self.features_init = ConvBNReLU(3, input_channel, stride=2)
        self.unfold = nn.Unfold(kernel_size=(6,6),stride=(4,4))
        self.rnn_model = RNNPool(6, 6, 8, 8, input_channel)#num_init_features)
        self.fold = nn.Fold(kernel_size=(1,1),output_size=(27,27))
        self.rnn_model_end = RNNPool(7, 7, int(self.last_channel/4), int(self.last_channel/4), self.last_channel)
        features=[] 
        input_channel = 32
        # building inverted residual blocks
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # building last several layers
        features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1))
        # make it nn.Sequential
        self.features = nn.Sequential(*features)
        # building classifier
        self.classifier = nn.Sequential(
            #nn.Dropout(0.2),
            nn.Linear(self.last_channel, num_classes),
        )
        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
    def forward(self, x):
        batch_size = x.shape[0]
        x = self.features_init(x)
        patches = self.unfold(x)
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,8,6,6))
        output_x = int((x.shape[2]-6)/4 + 1)
        output_y = int((x.shape[3]-6)/4 + 1)
        rnnX = self.rnn_model(patches, int(batch_size)*output_x*output_y)
        x = torch.stack(torch.split(rnnX, split_size_or_sections=int(batch_size), dim=0),dim=2)
        x = self.fold(x)
        x = F.pad(x, (0,1,0,1), mode='replicate')        
        x = self.features(x)
        x = self.rnn_model_end(x, batch_size)
        x = self.classifier(x)
        return x
 def mobilenetv2_rnnpool(pretrained=False, progress=True, **kwargs):
    """
    Constructs a MobileNetV2 architecture from
    `"MobileNetV2: Inverted Residuals and Linear Bottlenecks" <https://arxiv.org/abs/1801.04381>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    model = MobileNetV2(**kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls['mobilenet_v2'],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model
--- a/examples/pytorch/vision/Visual_Wakeword/model_mobilenet_rnnpool.py
+++ b/examples/pytorch/vision/Visual_Wakeword/model_mobilenet_rnnpool.py
@ -0,0 +1,206 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import re
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import numpy as np
 import torch.utils.checkpoint as cp
 from collections import OrderedDict
 from torchvision.models.utils import load_state_dict_from_url
 from edgeml_pytorch.graph.rnnpool import *
 __all__ = ['MobileNetV2', 'mobilenetv2_rnnpool']
 model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
 }
 def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes, momentum=0.01),
            nn.ReLU6(inplace=True)
        )
 class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.use_res_connect = self.stride == 1 and inp == oup
        layers = []
        if expand_ratio != 1:
            # pw
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1))
        layers.extend([
            # dw
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim),
            # pw-linear
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup, momentum=0.01),
        ])
        self.conv = nn.Sequential(*layers)
    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)
 class MobileNetV2(nn.Module):
    def __init__(self, 
                 num_classes=1000, 
                 width_mult=0.5, 
                 inverted_residual_setting=None, 
                 round_nearest=8,
                 block=None,
                 last_channel = 1280):
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
            block: Module specifying inverted residual building block for mobilenet
        """
        super(MobileNetV2, self).__init__()
        if block is None:
            block = InvertedResidual
        input_channel = 8
        #last_channel = 1280
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s
                [6, 64, 4, 2],
                [6, 96, 3, 1],
                [6, 160, 3, 2],
                [6, 320, 1, 1],
            ]
        # only check the first element, assuming user knows t,c,n,s are required
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))
        # building first layer
        input_channel = _make_divisible(input_channel, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        self.features_init = ConvBNReLU(3, input_channel, stride=2)
        self.unfold = nn.Unfold(kernel_size=(6,6),stride=(4,4))
        self.rnn_model = RNNPool(6, 6, 8, 8, input_channel)#num_init_features)
        self.fold = nn.Fold(kernel_size=(1,1),output_size=(27,27))
        features=[]
        input_channel = 32
        # building inverted residual blocks
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # building last several layers
        features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1))
        self.features = nn.Sequential(*features)
        # building classifier
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(self.last_channel, num_classes),
        )
        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
    def forward(self, x):
        batch_size = x.shape[0]
        x = self.features_init(x)
        patches = self.unfold(x)
        patches = torch.cat(torch.unbind(patches,dim=2),dim=0)
        patches = torch.reshape(patches,(-1,8,6,6))
        output_x = int((x.shape[2]-6)/4 + 1)
        output_y = int((x.shape[3]-6)/4 + 1)
        rnnX = self.rnn_model(patches, int(batch_size)*output_x*output_y)
        x = torch.stack(torch.split(rnnX, split_size_or_sections=int(batch_size), dim=0),dim=2)
        x = self.fold(x)
        x = F.pad(x, (0,1,0,1), mode='replicate')        
        x = self.features(x)
        x = x.mean([2, 3])
        x = self.classifier(x)
        return x
 def mobilenetv2_rnnpool(pretrained=False, progress=True, **kwargs):
    """
    Constructs a MobileNetV2 architecture from
    `"MobileNetV2: Inverted Residuals and Linear Bottlenecks" <https://arxiv.org/abs/1801.04381>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    model = MobileNetV2(**kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls['mobilenet_v2'],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model
--- a/examples/pytorch/vision/Visual_Wakeword/requirements.txt
+++ b/examples/pytorch/vision/Visual_Wakeword/requirements.txt
@ -0,0 +1,9 @@
 pycocotools
 pyvww
 easydict==1.9
 importlib-metadata==1.5.0
 matplotlib==3.2.1
 opencv-python-headless==4.2.0.32
 scikit-image==0.15.0
 tensorboard==1.14.0
 tensorboardX==1.9
--- a/examples/pytorch/vision/Visual_Wakeword/scripts/create_coco_train_minival_split.py
+++ b/examples/pytorch/vision/Visual_Wakeword/scripts/create_coco_train_minival_split.py
@ -0,0 +1,120 @@
 ## Code from https://github.com/Mxbonn/visualwakewords
 """Create maxitrain and minival annotations.
    This script generates a new train validation split with 115k training and 8k validation images.
    Based on the split used by Google
    (https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_minival_ids.txt).
    Usage:
    From this folder, run the following commands: (2014 can be replaced by 2017 if you downloaded the 2017 dataset)
        TRAIN_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_train2014.json"
        VAL_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_val2014.json"
        OUTPUT_DIR="path-to-mscoco-dataset/annotations/"
        python create_coco_train_minival_split.py \
          --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
          --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
          --output_dir="${OUTPUT_DIR}"
 """
 import json
 import os
 from argparse import ArgumentParser
 def create_maxitrain_minival(train_file, val_file, output_dir):
    """ Generate maxitrain and minival annotations files.
    Loads COCO 2014/2017 train and validation json files and creates a new split with
    115k training images and 8k validation images.
    Based on the split used by Google
    (https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_minival_ids.txt).
    Args:
        train_file: JSON file containing COCO 2014 or 2017 train annotations
        val_file: JSON file containing COCO 2014 or 2017 validation annotations
        output_dir: Directory where the new annotation files will be stored.
    """
    maxitrain_path = os.path.join(
        output_dir, 'instances_maxitrain.json')
    minival_path = os.path.join(
        output_dir, 'instances_minival.json')
    train_json = json.load(open(train_file, 'r'))
    val_json = json.load(open(val_file, 'r'))
    info = train_json['info']
    categories = train_json['categories']
    licenses = train_json['licenses']
    dir_path = os.path.dirname(os.path.realpath(__file__))
    file_path = os.path.join(dir_path, 'mscoco_minival_ids.txt')
    minival_ids_f = open(file_path, 'r')
    minival_ids = minival_ids_f.readlines()
    minival_ids = [int(i) for i in minival_ids]
    train_images = train_json['images']
    val_images = val_json['images']
    train_annotations = train_json['annotations']
    val_annotations = val_json['annotations']
    maxitrain_images = []
    minival_images = []
    maxitrain_annotations = []
    minival_annotations = []
    for _images in [train_images, val_images]:
        for img in _images:
            img_id = img['id']
            if img_id in minival_ids:
                minival_images.append(img)
            else:
                maxitrain_images.append(img)
    for _annotations in [train_annotations, val_annotations]:
        for ann in _annotations:
            img_id = ann['image_id']
            if img_id in minival_ids:
                minival_annotations.append(ann)
            else:
                maxitrain_annotations.append(ann)
    with open(maxitrain_path, 'w') as fp:
        json.dump(
            {
                "info": info,
                "licenses": licenses,
                'images': maxitrain_images,
                'annotations': maxitrain_annotations,
                'categories': categories,
            }, fp)
    with open(minival_path, 'w') as fp:
        json.dump(
            {
                "info": info,
                "licenses": licenses,
                'images': minival_images,
                'annotations': minival_annotations,
                'categories': categories,
            }, fp)
 def main(args):
    output_dir = os.path.realpath(os.path.expanduser(args.output_dir))
    train_annotations_file = os.path.realpath(os.path.expanduser(args.train_annotations_file))
    val_annotations_file = os.path.realpath(os.path.expanduser(args.val_annotations_file))
    if not os.path.isdir(output_dir):
        os.makedirs(output_dir)
    create_maxitrain_minival(train_annotations_file, val_annotations_file, output_dir)
 if __name__ == '__main__':
    parser = ArgumentParser(description="Script that takes the 2014/2017 training and validation annotations and"
                                        "creates a train split of 115k images and a minival of 8k.")
    parser.add_argument('--train_annotations_file', type=str, required=True,
                        help='COCO2014/2017 Training annotations JSON file')
    parser.add_argument('--val_annotations_file', type=str, required=True,
                        help='COCO2014/2017 Validation annotations JSON file')
    parser.add_argument('--output_dir', type=str, required=True,
                        help='Output directory where the maxitrain and minival annotations files will be stored')
    args = parser.parse_args()
    main(args)
--- a/examples/pytorch/vision/Visual_Wakeword/scripts/create_visualwakewords_annotations.py
+++ b/examples/pytorch/vision/Visual_Wakeword/scripts/create_visualwakewords_annotations.py
@ -0,0 +1,217 @@
 ## Code from https://github.com/Mxbonn/visualwakewords
 """Create Visual Wakewords annotations.
    This script generates the Visual WakeWords dataset annotations from the raw COCO dataset.
    The resulting annotations can then be used with `pyvww.utils.VisualWakeWords` and
    `pyvww.pytorch.VisualWakeWordsClassification`.
    Visual WakeWords Dataset is derived from the COCO dataset to design tiny models
    classifying two classes, such as person/not-person. The COCO annotations
    are filtered to two classes: foreground_class and background
    (for e.g. person and not-person). Bounding boxes for small objects
    with area less than 5% of the image area are filtered out.
    The resulting annotations file follows the COCO data format.
    {
      "info" : info,
      "images" : [image],
      "annotations" : [annotation],
      "licenses" : [license],
    }
    info{
      "year" : int,
      "version" : str,
      "description" : str,
      "url" : str,
    }
    image{
      "id" : int,
      "width" : int,
      "height" : int,
      "file_name" : str,
      "license" : int,
      "flickr_url" : str,
      "coco_url" : str,
      "date_captured" : datetime,
    }
    license{
      "id" : int,
      "name" : str,
      "url" : str,
    }
    annotation{
      "id" : int,
      "image_id" : int,
      "category_id" : int,
      "area" : float,
      "bbox" : [x,y,width,height],
      "iscrowd" : 0 or 1,
    }
    Example usage:
    From this folder, run the following commands:
        bash download_mscoco.sh path-to-mscoco-dataset
        TRAIN_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_train2014.json"
        VAL_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_val2014.json"
        DIR="path-to-mscoco-dataset/annotations/"
        python create_coco_train_minival_split.py \
          --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
          --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
          --output_dir="${DIR}"
        MAXITRAIN_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_maxitrain.json"
        MINIVAL_ANNOTATIONS_FILE="path-to-mscoco-dataset/annotations/instances_minival.json"
        VWW_OUTPUT_DIR="new-path-to-visualwakewords-dataset/annotations/"
        python create_visualwakewords_annotations.py \
          --train_annotations_file="${MAXITRAIN_ANNOTATIONS_FILE}" \
          --val_annotations_file="${MINIVAL_ANNOTATIONS_FILE}" \
          --output_dir="${VWW_OUTPUT_DIR}" \
          --threshold=0.005 \
          --foreground_class='person'
 """
 import json
 import os
 from argparse import ArgumentParser
 from pycocotools.coco import COCO
 def create_visual_wakeword_annotations(annotations_file,
                                       visualwakewords_annotations_path,
                                       object_area_threshold,
                                       foreground_class_name):
    """Generate visual wake words annotations file.
    Loads COCO annotation json files and filters to foreground_class_name/not-foreground_class_name
    (by default it will be person/not-person) to generate visual wake words annotations file.
    Each image is assigned a label 1 or 0. The label 1 is assigned as long
    as it has at least one foreground_class_name (e.g. person)
    bounding box greater than object_area_threshold (e.g. 5% of the image area).
    Args:
      annotations_file: JSON file containing COCO bounding box annotations
      visualwakewords_annotations_path: output path to annotations file
      object_area_threshold: threshold on fraction of image area below which
        small object bounding boxes are filtered
      foreground_class_name: category from COCO dataset that is filtered by
        the visual wakewords dataset
    """
    print('Processing {}...'.format(annotations_file))
    coco = COCO(annotations_file)
    info = {"description": "Visual Wake Words Dataset",
            "url": "https://arxiv.org/abs/1906.05721",
            "version": "1.0",
            "year": 2019,
            }
    # default object of interest is person
    foreground_class_id = 1
    dataset = coco.dataset
    licenses = dataset['licenses']
    images = dataset['images']
    # Create category index
    foreground_category = None
    background_category = {'supercategory': 'background', 'id': 0, 'name': 'background'}
    for category in dataset['categories']:
        if category['name'] == foreground_class_name:
            foreground_class_id = category['id']
            foreground_category = category
    foreground_category['id'] = 1
    background_category['name'] = "not-{}".format(foreground_category['name'])
    categories = [background_category, foreground_category]
    if not 'annotations' in dataset:
        raise KeyError('Need annotations in json file to build the dataset.')
    new_ann_id = 0
    annotations = []
    positive_img_ids = set()
    foreground_imgs_ids = coco.getImgIds(catIds=foreground_class_id)
    for img_id in foreground_imgs_ids:
        img = coco.imgs[img_id]
        img_area = img['height'] * img['width']
        for ann_id in coco.getAnnIds(imgIds=img_id, catIds=foreground_class_id):
            ann = coco.anns[ann_id]
            if 'area' in ann:
                normalized_ann_area = ann['area'] / img_area
                if normalized_ann_area > object_area_threshold:
                    new_ann = {
                        "id": new_ann_id,
                        "image_id": img_id,
                        "category_id": 1,
                        "area": ann["area"],
                        "bbox": ann["bbox"],
                        "iscrowd": ann["iscrowd"],
                    }
                    annotations.append(new_ann)
                    positive_img_ids.add(img_id)
                    new_ann_id += 1
    print("There are {} images that now have label {}, of the {} images in total.".format(len(positive_img_ids),
                                                                                          foreground_class_name,
                                                                                          len(coco.imgs)))
    negative_img_ids = list(set(coco.imgs.keys()) - positive_img_ids)
    for img_id in negative_img_ids:
        new_ann = {
            "id": new_ann_id,
            "image_id": img_id,
            "category_id": 0,
            "area": 0.0,
            "bbox": [],
            "iscrowd": 0,
        }
        annotations.append(new_ann)
        new_ann_id += 1
    # Output Visual WakeWords annotations and labels
    with open(visualwakewords_annotations_path, 'w') as fp:
        json.dump(
            {
                "info": info,
                "licenses": licenses,
                'images': images,
                'annotations': annotations,
                'categories': categories,
            }, fp)
 def main(args):
    output_dir = os.path.realpath(os.path.expanduser(args.output_dir))
    train_annotations_file = os.path.realpath(os.path.expanduser(args.train_annotations_file))
    val_annotations_file = os.path.realpath(os.path.expanduser(args.val_annotations_file))
    visualwakewords_annotations_train = os.path.join(
        output_dir, 'instances_train.json')
    visualwakewords_annotations_val = os.path.join(
        output_dir, 'instances_val.json')
    small_object_area_threshold = args.threshold
    foreground_class_of_interest = args.foreground_class
    # Create the Visual WakeWords annotations from COCO annotations
    if not os.path.isdir(output_dir):
        os.makedirs(output_dir)
    create_visual_wakeword_annotations(
        train_annotations_file, visualwakewords_annotations_train,
        small_object_area_threshold, foreground_class_of_interest)
    create_visual_wakeword_annotations(
        val_annotations_file, visualwakewords_annotations_val,
        small_object_area_threshold, foreground_class_of_interest)
 if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--train_annotations_file', type=str, required=True,
                        help='(COCO) Training annotations JSON file')
    parser.add_argument('--val_annotations_file', type=str, required=True,
                        help='(COCO) Validation annotations JSON file')
    parser.add_argument('--output_dir', type=str, default='/tmp/visualwakewords/',
                        help='Output directory where the Visual WakeWords annotations files be stored')
    parser.add_argument('--threshold', type=float, default=0.005,
                        help='Threshold of fraction of image area below which small objects are filtered.')
    parser.add_argument('--foreground_class', type=str, default='person',
                        help='Annotations will have a label indicating if this object is present or absent'
                             'in the scene (default is person/not-person).')
    args = parser.parse_args()
    main(args)
--- a/examples/pytorch/vision/Visual_Wakeword/scripts/download_mscoco.sh
+++ b/examples/pytorch/vision/Visual_Wakeword/scripts/download_mscoco.sh
@ -0,0 +1,80 @@
 # Copyright 2020 Maxim Bonnaerens. All Rights Reserved.
 #
 # Copyright 2019 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
 # File modified from tensorflow/models/research/slim/datasets/download_mscoco.sh
 # Script to download the COCO dataset. See
 # http://cocodataset.org/#overview for an overview of the dataset.
 #
 # usage:
 #  bash scripts/download_mscoco.sh path-to-COCO-dataset
 #
 set -e
 YEAR=${2:-2014}
 if [ -z "$1" ]; then
  echo "usage download_mscoco.sh [data dir] (2014|2017)"
  exit
 fi
 if [ "$(uname)" == "Darwin" ]; then
  UNZIP="tar -xf"
 else
  UNZIP="unzip -nq"
 fi
 # Create the output directories.
 OUTPUT_DIR="${1%/}"
 mkdir -p "${OUTPUT_DIR}"
 # Helper function to download and unpack a .zip file.
 function download_and_unzip() {
  local BASE_URL=${1}
  local FILENAME=${2}
  if [ ! -f "${FILENAME}" ]; then
    echo "Downloading ${FILENAME} to $(pwd)"
    wget -nd -c "${BASE_URL}/${FILENAME}"
  else
    echo "Skipping download of ${FILENAME}"
  fi
  echo "Unzipping ${FILENAME}"
  ${UNZIP} "${FILENAME}"
  rm "${FILENAME}"
 }
 cd "${OUTPUT_DIR}"
 # Download the images.
 BASE_IMAGE_URL="http://images.cocodataset.org/zips"
 TRAIN_IMAGE_FILE="train${YEAR}.zip"
 download_and_unzip ${BASE_IMAGE_URL} "${TRAIN_IMAGE_FILE}"
 TRAIN_IMAGE_DIR="${OUTPUT_DIR}/train${YEAR}"
 VAL_IMAGE_FILE="val${YEAR}.zip"
 download_and_unzip ${BASE_IMAGE_URL} "${VAL_IMAGE_FILE}"
 VAL_IMAGE_DIR="${OUTPUT_DIR}/val${YEAR}"
 COMMON_DIR="all$YEAR"
 mkdir -p "${COMMON_DIR}"
 for i in ${TRAIN_IMAGE_DIR}/*; do cp --symbolic-link "$i" ${COMMON_DIR}/; done
 for i in ${VAL_IMAGE_DIR}/*; do cp --symbolic-link "$i" ${COMMON_DIR}/; done
 # Download the annotations.
 BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
 INSTANCES_FILE="annotations_trainval${YEAR}.zip"
 download_and_unzip ${BASE_INSTANCES_URL} "${INSTANCES_FILE}"
--- a/examples/pytorch/vision/Visual_Wakeword/scripts/mscoco_minival_ids.txt
+++ b/examples/pytorch/vision/Visual_Wakeword/scripts/mscoco_minival_ids.txt
--- a/examples/pytorch/vision/Visual_Wakeword/train_visualwakewords.py
+++ b/examples/pytorch/vision/Visual_Wakeword/train_visualwakewords.py
@ -0,0 +1,249 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import torch.nn.functional as F
 import torch.backends.cudnn as cudnn
 import torchvision
 import torchvision.models as models
 import torchvision.transforms as transforms
 import os
 import argparse
 import random
 from PIL import Image
 import numpy as np
 from torchvision.datasets.vision import VisionDataset
 from importlib import import_module
 from pyvww.utils import VisualWakeWords
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 torch.backends.cudnn.benchmark = True
 torch.backends.cudnn.enabled = True
 best_acc = 0  # best test accuracy
 start_epoch = 0 
 #Arg parser
 parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
 parser.add_argument('--lr', default=0.05, type=float, help='learning rate')
 parser.add_argument('--epochs', default=900, type=int, help='total epochs')
 parser.add_argument('--resume', default=None, type=str, help='load from checkpoint')
 parser.add_argument('--model_arch',
                    default='model_mobilenet_rnnpool', type=str,
                    choices=['model_mobilenet_rnnpool', 'model_mobilenet_2rnnpool'],
                    help='choose architecture among rpool variants')
 parser.add_argument('--ann', default=None, type=str, 
    help='specify new-path-to-visualwakewords-dataset used in dataset creation step')
 parser.add_argument('--data', default=None, type=str, 
    help='specify path-to-mscoco-dataset used in dataset creation step')
 args = parser.parse_args()
 # Data
 class VisualWakeWordsClassification(VisionDataset):
    """`Visual Wake Words <https://arxiv.org/abs/1906.05721>`_ Dataset.
    Args:
        root (string): Root directory where COCO images are downloaded to.
        annFile (string): Path to json visual wake words annotation file.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.ToTensor``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """
    def __init__(self, root, annFile, transform=None, target_transform=None, split='val'):
        # super(VisualWakeWordsClassification, self).__init__(root, annFile, transform, target_transform, split)
        self.vww = VisualWakeWords(annFile)
        self.ids = list(sorted(self.vww.imgs.keys()))
        self.split = split
        self.transform = transform
        self.target_transform = target_transform
        self.root = root
    def __getitem__(self, index):
        """
        Args:
            index (int): Index
        Returns:
            tuple: Tuple (image, target). target is the index of the target class.
        """
        vww = self.vww
        img_id = self.ids[index]
        ann_ids = vww.getAnnIds(imgIds=img_id)
        target = vww.loadAnns(ann_ids)[0]['category_id']
        path = vww.loadImgs(img_id)[0]['file_name']
        img = Image.open(os.path.join(self.root, path)).convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        if self.target_transform is not None:
            target = self.target_transform(target)
        return img, target
    def __len__(self):
        return len(self.ids)
 normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
 transform_train = transforms.Compose([
    # transforms.RandomAffine(10, translate=None, shear=(5,5,5,5), resample=False, fillcolor=0),
    transforms.RandomResizedCrop(size=(224,224), scale=(0.2,1.0)),
    transforms.RandomHorizontalFlip(),
    #transforms.RandomAffine(10, translate=None, shear=(5,5,5,5), resample=False, fillcolor=0),
    # transforms.ColorJitter(brightness=(0.6,1.4), saturation=(0.9,1.1), hue=(-0.1,0.1)),
    transforms.ToTensor(),
    normalize
 ])
 transform_test = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
 ]) 
 trainset = VisualWakeWordsClassification(root=os.path.join(args.data,'all2014'), 
                    annFile=os.path.join(args.ann, 'annotations/instances_train.json'), 
                    transform=transform_train, split='train')
 trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True, 
                                                num_workers=32)
 testset = VisualWakeWordsClassification(root=os.path.join(args.data,'all2014'), 
                    annFile=os.path.join(args.ann, 'annotations/instances_val.json'), 
                    transform=transform_test, split='val')
 testloader = torch.utils.data.DataLoader(testset, batch_size=256, shuffle=False, 
                                                num_workers=32)
 # Model
 module = import_module(args.model_arch)
 model = module.mobilenetv2_rnnpool(num_classes=2, width_mult=0.35, last_channel=320)
 model = model.to(device)
 model = torch.nn.DataParallel(model)
 if args.resume:
    # Load checkpoint.
    print('==> Resuming from checkpoint..')
    assert os.path.isdir('./checkpoints/'), 'Error: no checkpoint directory found!'
    checkpoint = torch.load('./checkpoints/' + args.resume)
    best_acc = checkpoint['acc']
    start_epoch = checkpoint['epoch']
 criterion = nn.CrossEntropyLoss().cuda()
 optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, weight_decay=4e-5)#, alpha=0.9)
 # Training
 def train(epoch):
    print('\nEpoch: %d' % epoch)
    model.train()
    train_loss = 0
    correct = 0 
    total = 0
    train_loader_len = len(trainloader)
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        adjust_learning_rate(optimizer, epoch, batch_idx, train_loader_len)
        batch_size = inputs.shape[0]
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
    print('train_loss: ',train_loss/total, ' acc: ', correct/total)
    print('->>lr:{:.6f}'.format(optimizer.param_groups[0]['lr']))
 def test(epoch):
    global best_acc
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            batch_size = inputs.shape[0]
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    print('test_loss: ',test_loss/total, ' test_acc: ', correct/total)
    # Save checkpoint.
    print('best acc: ', best_acc)
    acc = 100.*correct/total
    if acc > best_acc:
        print('Saving..')
        state = {
            'model': model.state_dict(),
            'acc': acc,
            'epoch': epoch,
        }
        if not os.path.isdir('./checkpoints/'):
            os.mkdir('./checkpoints/')
        torch.save(state, './checkpoints/model_mobilenet_rnnpool.pth')
        best_acc = acc
 from math import cos, pi
 def adjust_learning_rate(optimizer, epoch, iteration, num_iter):
    lr = optimizer.param_groups[0]['lr']
    warmup_epoch = 0
    warmup_iter = warmup_epoch * num_iter
    current_iter = iteration + epoch * num_iter
    max_iter = 150 * num_iter
    lr = args.lr * (1 + cos(pi * (current_iter - warmup_iter) / (max_iter - warmup_iter))) / 2
    if epoch < warmup_epoch:
        lr = args.lr * current_iter / warmup_iter
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
 for epoch in range(start_epoch, start_epoch+args.epochs):
    train(epoch)    
    test(epoch)
--- a/examples/pytorch/vision/cpp/README.md
+++ b/examples/pytorch/vision/cpp/README.md
@ -0,0 +1,18 @@
 # RNNPool quantized sample code
 The `rnnpool_quantized.cpp` code takes the activations preceding the RNNpool layer
 and produces the output of a quantized RNN pool layer. The input numpy file consists 
 of all activation patches corresponding to a single image. In `trace_0_input.npy`,
 there are 6241 patches of dimensions 8x8 with 4 channels to which RNNPool is applied.
 The output is of size 6241*4*8. This can be compared to the floatin point output stored in
 `trace_0_output.npy`
 ```shell
 g++ -o rnnpool_quantized rnnpool_quantized.cpp
 # Usage: ./rnnpool_quantized <#patches> <input.npy> <output.npy>
 ./rnnpool_quantized 6241 trace_0_input.npy trace_0_output_quantized.npy
 ```
 Copyright (c) Microsoft Corporation. All rights reserved.
 Licensed under the MIT license
--- a/examples/pytorch/vision/cpp/data.h
+++ b/examples/pytorch/vision/cpp/data.h
@ -0,0 +1,59 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // // Licensed under the MIT license
 int16_t W1[4][8] = {{7069, 3262, 5513, 4733, -2708, -10109, 5233, 4489}
 , {-10390, -38, -17036, -404, -1288, -138, 226, -1100}
 , {1562, -1144, -14616, 4106, -18129, 2064, 831, 2845}
 , {-1993, -996, -6637, -1105, -1833, 1207, -1910, -1262}
 }; //16384
 int16_t U1[8][8] = {{15238, -4081, -18973, -1468, 3401, 12650, -911, -1588}
 , {-1372, -2625, 23200, 5474, 7390, -3379, 5065, 7849}
 , {-931, -10160, -4142, -3773, 1400, 1952, -7027, -4937}
 , {-311, 3353, 10395, -410, 2437, 426, 5921, 4664}
 , {3195, -2369, -20748, -7006, 5303, -2544, -1009, -11564}
 , {-4775, 5477, -4431, 2161, 829, 18282, 1428, 3197}
 , {-435, 4946, 11025, 4571, 1986, -2559, -1213, 4943}
 , {16, 3484, 10337, 5800, 2855, 549, 5397, 561}
 }; //32768
 int16_t Bg1[1][8] = {{-18778, -9519, 4055, -7310, 8584, -17258, -5281, -7934}
 }; //16384
 int16_t Bh1[1][8] = {{9658, 19740, -10058, 19114, 17227, 12226, 19080, 15855}
 }; //32768
 int16_t zeta1 = 32522; //32768
 int16_t nu1 = 235; //32768
 int16_t W2[8][8] = {{-850, 359, -9842, 5701, 7390, -4590, -3959, 2759}
 , {-1536, -6107, -1978, -5420, -1215, -5065, 77, -4658}
 , {10036, -340, 745, -3625, 1684, -1927, 2312, 2028}
 , {-3593, -1295, -997, -1, 1441, 2806, -1718, -3687}
 , {-287, -221, -1398, 439, -1651, 3409, -19972, -193}
 , {-6120, -4338, -1679, -9576, 13070, -12784, -56, -5648}
 , {-5623, -2853, -862, -3739, 2595, -285, -673, -5104}
 , {-3761, -842, -713, 396, 1405, 3339, -1477, -3670}
 }; //16384
 int16_t U2[8][8] = {{8755, 2010, -3642, -913, 5998, -2312, -389, -1571}
 , {-906, 9661, -1875, -328, 4034, -3910, -355, -5117}
 , {-2433, 1688, 1328, -1493, 4122, 769, -177, 9988}
 , {-2759, 2240, 1795, 6117, 6542, -6011, 710, 283}
 , {-3163, 5634, 15468, -1189, 704, -1739, 483, 3409}
 , {-4224, 5383, -15324, -2616, 19957, 2042, -579, -319}
 , {181, -1085, 863, 1111, -4614, 4177, 3342, 4059}
 , {312, 996, -3600, -867, 2397, -1214, -917, 8633}
 }; //16384
 int16_t Bg2[1][8] = {{-5411, -15415, -13003, -12122, -18931, -17923, -8693, -12151}
 }; //16384
 int16_t Bh2[1][8] = {{21417, 6457, 6421, 8970, 6601, 836, 3060, 8468}
 }; //16384
 int16_t zeta2 = 32520; //32768
 int16_t nu2 = 256; //32768
--- a/examples/pytorch/vision/cpp/rnnpool_quantized.cpp
+++ b/examples/pytorch/vision/cpp/rnnpool_quantized.cpp
@ -0,0 +1,535 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT license.
 #include <fstream>
 #include <iostream>
 #include <unordered_map>
 #include <string>
 #include <cstring>
 #include <cstdlib>
 #include <cmath>
 #include <algorithm>
 #include <vector>
 using namespace std;
 #define MYINT int16_t
 #define MYITE int16_t
 #include "data.h"
 #define SHIFT
 #ifdef SHIFT
 #define MYSCL int16_t
 unordered_map<string, MYSCL> scale = {
    {"X", 12},
    {"one", 14},
    {"W1",14},
    {"H1",14},
    {"U1",15},
    {"Bg1",14},
    {"Bh1",15},
    {"zeta1",15},
    {"nu1",15},
    {"a1",11},
    {"b1",13},
    {"c1",11},
    {"cBg1",11},
    {"cBh1",11},
    {"g1",14},
    {"h1",14},
    {"z1",14},
    {"y1",14},
    {"w1",14},
    {"v1",14},
    {"u1",14},
    {"intermediate", 14},
    {"W2",14},
    {"H2",14},
    {"U2",14},
    {"Bg2",14},
    {"Bh2",14},
    {"zeta2",15},
    {"nu2",15},
    {"a2",14},
    {"b2",13},
    {"c2",13},
    {"cBg2",11},
    {"cBh2",11},
    {"g2",14},
    {"h2",14},
    {"z2",15},
    {"y2",14},
    {"w2",14},
    {"v2",14},
    {"u2",14},
    {"Y",14},
 };
 #else
 #define MYSCL int32_t
 unordered_map<string, MYSCL> scale = {
    {"X", 4096},
    {"one", 16384},
    {"W1",16384},
    {"H1",16384},
    {"U1",32768},
    {"Bg1",16384},
    {"Bh1",32768},
    {"zeta1",32768},
    {"nu1",32768},
    {"a1",2048},
    {"b1",8192},
    {"c1",2048},
    {"cBg1",2048},
    {"cBh1",2048},
    {"g1",16384},
    {"h1",16384},
    {"z1",16384},
    {"y1",16384},
    {"w1",16384},
    {"v1",16384},
    {"u1",16384},
    {"intermediate", 16384},
    {"W2",16384},
    {"H2",16384},
    {"U2",16384},
    {"Bg2",16384},
    {"Bh2",16384},
    {"zeta2",32768},
    {"nu2",32768},
    {"a2",16384},
    {"b2",8192},
    {"c2",8192},
    {"cBg2",2048},
    {"cBh2",2048},
    {"g2",16384},
    {"h2",16384},
    {"z2",32768},
    {"y2",16384},
    {"w2",16384},
    {"v2",16384},
    {"u2",16384},
    {"Y",16384},
 };
 #endif
 void MatMul(int16_t* A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYINT K, MYSCL scA, MYSCL scB, MYSCL scC) {
 #ifdef SHIFT
  MYSCL addshrP = 1, addshr = 0;
  while (addshrP < J) {
    addshrP *= 2;
    addshr += 1;
  }
 #else
  MYSCL addshr = 1;
  while (addshr < J)
    addshr *= 2;
 #endif
 #ifdef SHIFT
  MYSCL shr = scA + scB - scC - addshr;
 #else
  MYSCL shr = (scA * scB) / (scC * addshr);
 #endif
  for (int i = 0; i < I; i++) {
    for (int k = 0; k < K; k++) {
      int32_t s = 0;
      for (int j = 0; j < J; j++) {
 #ifdef SHIFT
        s += ((int32_t)A[i * J + j] * (int32_t)B[j * K + k]) >> addshr;
 #else
        s += ((int32_t)A[i * J + j] * (int32_t)B[j * K + k]) / addshr;
 #endif
      }
 #ifdef SHIFT
      C[i * K + k] = s >> shr;
 #else
      C[i * K + k] = s / shr;
 #endif
    }
  }
 }
 inline MYINT min(MYINT a, MYINT b) {
  return a < b ? a : b;
 }
 inline MYINT max(MYINT a, MYINT b) {
  return a > b ? a : b;
 }
 void MatAdd(int16_t* A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYSCL scA, MYSCL scB, MYSCL scC) {
  MYSCL shrmin = min(scA, scB);
 #ifdef SHIFT
  MYSCL shra = scA - shrmin;
  MYSCL shrb = scB - shrmin;
  MYSCL shrc = shrmin - scC;
 #else
  MYSCL shra = scA / shrmin;
  MYSCL shrb = scB / shrmin;
  MYSCL shrc = shrmin / scC;
 #endif
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
 #ifdef SHIFT
      C[i * J + j] = ((A[i * J + j] >> (shra + shrc)) + (B[i * J + j] >> (shrb + shrc)));
 #else
      C[i * J + j] = ((A[i * J + j] / (shra * shrc)) + (B[i * J + j] / (shrb * shrc)));
 #endif
    }
  }
 }
 void ScalarMatSub(int16_t A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYSCL scA, MYSCL scB, MYSCL scC) {
  MYSCL shrmin = min(scA, scB);
 #ifdef SHIFT
  MYSCL shra = scA - shrmin;
  MYSCL shrb = scB - shrmin;
  MYSCL shrc = shrmin - scC;
 #else
  MYSCL shra = scA / shrmin;
  MYSCL shrb = scB / shrmin;
  MYSCL shrc = shrmin / scC;
 #endif
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
 #ifdef SHIFT
      C[i * J + j] = ((A >> (shra + shrc)) - (B[i * J + j] >> (shrb + shrc)));
 #else
      C[i * J + j] = ((A / (shra * shrc)) - (B[i * J + j] / (shrb * shrc)));
 #endif
    }
  }
 }
 void ScalarMatAdd(int16_t A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYSCL scA, MYSCL scB, MYSCL scC) {
  MYSCL shrmin = min(scA, scB);
 #ifdef SHIFT
  MYSCL shra = scA - shrmin;
  MYSCL shrb = scB - shrmin;
  MYSCL shrc = shrmin - scC;
 #else
  MYSCL shra = scA / shrmin;
  MYSCL shrb = scB / shrmin;
  MYSCL shrc = shrmin / scC;
 #endif
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
 #ifdef SHIFT
      C[i * J + j] = ((A >> (shra + shrc)) + (B[i * J + j] >> (shrb + shrc)));
 #else
      C[i * J + j] = ((A / (shra * shrc)) + (B[i * J + j] / (shrb * shrc)));
 #endif
    }
  }
 }
 void HadMul(int16_t* A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYSCL scA, MYSCL scB, MYSCL scC) {
 #ifdef SHIFT
  MYSCL shr = (scA + scB) - scC;
 #else
  MYSCL shr = (scA * scB) / scC;
 #endif
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
 #ifdef SHIFT
      C[i * J + j] = (((int32_t)A[i * J + j]) * ((int32_t)B[i * J + j])) >> shr;
 #else
      C[i * J + j] = (((int32_t)A[i * J + j]) * ((int32_t)B[i * J + j])) / shr;
 #endif
    }
  }
 }
 void ScalarMul(int16_t A, int16_t* B, int16_t* C, MYINT I, MYINT J, MYSCL scA, MYSCL scB, MYSCL scC) {
 #ifdef SHIFT
  MYSCL shr = (scA + scB) - scC;
 #else
  MYSCL shr = (scA * scB) / scC;
 #endif
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
 #ifdef SHIFT
      C[i * J + j] = ((int32_t)(A) * (int32_t)(B[i * J + j])) >> shr;
 #else
      C[i * J + j] = ((int32_t)(A) * (int32_t)(B[i * J + j])) / shr;
 #endif
    }
  }
 }
 void SigmoidNew16(int16_t* A, MYINT I, MYINT J, int16_t* B) {
  for (MYITE i = 0; i < I; i++) {
    for (MYITE j = 0; j < J; j++) {
      int16_t a = A[i * J + j];
      B[i * J + j] = 8 * max(min((a + 2048) / 2, 2048), 0);
    }
  }
  return;
 }
 void TanHNew16(int16_t* A, MYINT I, MYINT J, int16_t* B) {
  for (MYITE i = 0; i < I; i++) {
    for (MYITE j = 0; j < J; j++) {
      int16_t a = A[i * J + j];
      B[i * J + j] = 8 * max(min(a, 2048), -2048);
    }
  }
  return;
 }
 void reverse(int16_t* A, int16_t* B, int I, int J) {
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
      B[i * J + j] = A[(I - i - 1) * J + j];
    }
  }
 }
 void print(int16_t* var, int I, int J, MYSCL scale) {
  for (int i = 0; i < I; i++) {
    for (int j = 0; j < J; j++) {
      cout << ((float)var[i * J + j]) / scale << " ";
    }
    cout << endl;
  }
  //exit(1);
 }
 void FastGRNN1(int16_t X[8][4], int16_t* H, int timestep) {
  memset(&H[0], 0, 8 * 2);
  for (int i = 0; i < timestep; i++) {
    int16_t a[1][8];
    MatMul(&X[i][0], &W1[0][0], &a[0][0], 1, 4, 8, scale["X"], scale["W1"], scale["a1"]);
    int16_t b[1][8];
    MatMul(&H[0], &U1[0][0], &b[0][0], 1, 8, 8, scale["H1"], scale["U1"], scale["b1"]);
    int16_t c[1][8];
    MatAdd(&a[0][0], &b[0][0], &c[0][0], 1, 8, scale["a1"], scale["b1"], scale["c1"]);
    int16_t cBg[1][8];
    MatAdd(&c[0][0], &Bg1[0][0], &cBg[0][0], 1, 8, scale["c1"], scale["Bg1"], scale["cBg1"]);
    int16_t g[1][8];
    SigmoidNew16(&cBg[0][0], 1, 8, &g[0][0]);
    int16_t cBh[1][8];
    MatAdd(&c[0][0], &Bh1[0][0], &cBh[0][0], 1, 8, scale["c1"], scale["Bh1"], scale["cBh1"]);
    int16_t h[1][8];
    TanHNew16(&cBh[0][0], 1, 8, &h[0][0]);
    int16_t z[1][8];
    HadMul(&g[0][0], &H[0], &z[0][0], 1, 8, scale["g1"], scale["H1"], scale["z1"]);
    int16_t y[1][8];
    ScalarMatSub(16384, &g[0][0], &y[0][0], 1, 8, scale["one"], scale["g1"], scale["y1"]);
    int16_t w[1][8];
    ScalarMul(zeta1, &y[0][0], &w[0][0], 1, 8, scale["zeta1"], scale["y1"], scale["w1"]);
    int16_t v[1][8];
    ScalarMatAdd(nu1, &w[0][0], &v[0][0], 1, 8, scale["nu1"], scale["w1"], scale["v1"]);
    int16_t u[1][8];
    HadMul(&w[0][0], &h[0][0], &u[0][0], 1, 8, scale["w1"], scale["h1"], scale["u1"]);
    MatAdd(&z[0][0], &u[0][0], &H[0], 1, 8, scale["z1"], scale["u1"], scale["H1"]);
  }
 }
 void FastGRNN2(int16_t X[8][8], int16_t* H, int timestep) {
  memset(&H[0], 0, 8 * 2);
  for (int i = 0; i < timestep; i++) {
    int16_t a[1][8];
    MatMul(&X[i][0], &W2[0][0], &a[0][0], 1, 8, 8, scale["intermediate"], scale["W2"], scale["a2"]);
    int16_t b[1][8];
    MatMul(&H[0], &U2[0][0], &b[0][0], 1, 8, 8, scale["H2"], scale["U2"], scale["b2"]);
    int16_t c[1][8];
    MatAdd(&a[0][0], &b[0][0], &c[0][0], 1, 8, scale["a2"], scale["b2"], scale["c2"]);
    int16_t cBg[1][8];
    MatAdd(&c[0][0], &Bg2[0][0], &cBg[0][0], 1, 8, scale["c2"], scale["Bg2"], scale["cBg2"]);
    int16_t g[1][8];
    SigmoidNew16(&cBg[0][0], 1, 8, &g[0][0]);
    int16_t cBh[1][8];
    MatAdd(&c[0][0], &Bh2[0][0], &cBh[0][0], 1, 8, scale["c2"], scale["Bh2"], scale["cBh2"]);
    int16_t h[1][8];
    TanHNew16(&cBh[0][0], 1, 8, &h[0][0]);
    int16_t z[1][8];
    HadMul(&g[0][0], &H[0], &z[0][0], 1, 8, scale["g2"], scale["H2"], scale["z2"]);
    int16_t y[1][8];
    ScalarMatSub(16384, &g[0][0], &y[0][0], 1, 8, scale["one"], scale["g2"], scale["y2"]);
    int16_t w[1][8];
    ScalarMul(zeta2, &y[0][0], &w[0][0], 1, 8, scale["zeta2"], scale["y2"], scale["w2"]);
    int16_t v[1][8];
    ScalarMatAdd(nu2, &w[0][0], &v[0][0], 1, 8, scale["nu2"], scale["w2"], scale["v2"]);
    int16_t u[1][8];
    HadMul(&w[0][0], &h[0][0], &u[0][0], 1, 8, scale["w2"], scale["h2"], scale["u2"]);
    MatAdd(&z[0][0], &u[0][0], &H[0], 1, 8, scale["z2"], scale["u2"], scale["H2"]);
  }
 }
 void RNNPool(int16_t X[8][8][4], int16_t pred[1][32]) {
  int16_t biinput1[8][8], biinput1r[8][8];
  for (int i = 0; i < 8; i++) {
    int16_t subX[8][4];
    for (int j = 0; j < 8; j++) {
      for (int k = 0; k < 4; k++) {
        subX[j][k] = X[i][j][k];
      }
    }
    int16_t H[1][8];
    FastGRNN1(subX, &H[0][0], 8);
    for (int j = 0; j < 8; j++) {
      biinput1[i][j] = H[0][j];
    }
  }
  int16_t res1[1][8], res2[1][8];
  FastGRNN2(biinput1, &res1[0][0], 8);
  reverse(&biinput1[0][0], &biinput1r[0][0], 8, 8);
  FastGRNN2(biinput1r, &res2[0][0], 8);
  int16_t biinput2[8][8], biinput2r[8][8];
  for (int i = 0; i < 8; i++) {
    int16_t subX[8][4];
    for (int j = 0; j < 8; j++) {
      for (int k = 0; k < 4; k++) {
        subX[j][k] = X[j][i][k];
      }
    }
    int16_t H[1][8];
    FastGRNN1(subX, &H[0][0], 8);
    for (int j = 0; j < 8; j++) {
      biinput2[i][j] = H[0][j];
    }
  }
  int16_t res3[1][8], res4[1][8];
  FastGRNN2(biinput2, &res3[0][0], 8);
  reverse(&biinput2[0][0], &biinput2r[0][0], 8, 8);
  FastGRNN2(biinput2r, &res4[0][0], 8);
  for (int i = 0; i < 8; i++)
    pred[0][i] = res1[0][i];
  for (int i = 0; i < 8; i++)
    pred[0][i + 8] = res2[0][i];
  for (int i = 0; i < 8; i++)
    pred[0][i + 16] = res3[0][i];
  for (int i = 0; i < 8; i++)
    pred[0][i + 24] = res4[0][i];
 }
 int main(int argc, char* argv[]) {
  string inputfile, outputfile;
  int patches;
  if (argc != 4) {
    cerr << "Improper number of arguments" << endl;
    return -1;
  }
  else {
    patches = atoi(argv[1]);
    inputfile = string(argv[2]);
    outputfile = string(argv[3]);
  }
  fstream Xfile, Yfile;
  Xfile.open(inputfile, ios::in | ios::binary);
  Yfile.open(outputfile, ios::out | ios::binary);
  char line[8];
  Xfile.read(line, 8);
  int headerSize;
  Xfile.read((char*)&headerSize, 1 * 2);
  char* headerLine = new char[headerSize]; //Ignored
  Xfile.read(headerLine, headerSize);
  delete[] headerLine;
  char numpyMagix = 147;
  char numpyVersionMajor = 1, numpyVersionMinor = 0;
  string numpyMetaHeader = "";
  numpyMetaHeader += numpyMagix;
  numpyMetaHeader += "NUMPY";
  numpyMetaHeader += numpyVersionMajor;
  numpyMetaHeader += numpyVersionMinor;
  string numpyHeader = "{'descr': '<f4', 'fortran_order': False, 'shape': (" + to_string(patches) + ", 1, 32), }";
  for (int i = numpyHeader.size() + numpyMetaHeader.size() + 2; i % 64 != 64 - 1; i++) {
    numpyHeader += ' ';
  }
  numpyHeader += (char)(10);
  char a = numpyHeader.size() / 256, b = numpyHeader.size() % 256;
  Yfile << numpyMetaHeader;
  Yfile << b << a;
  Yfile << numpyHeader;
  int total = 0;
  int correct = 0;
  for (int i = 0; i < 6241; i++) {
    float Xline[256];
    Xfile.read((char*)&Xline[0], 256 * 4);
    int16_t y;
    int16_t reshapedX[8][8][4];
    for (int a = 0; a < 4; a++) {
      for (int b = 0; b < 8; b++) {
        for (int c = 0; c < 8; c++) {
 #ifdef SHIFT
          reshapedX[b][c][a] = (int16_t)((Xline[a * 64 + b * 8 + c * 1]) * pow(2, scale["X"]));
 #else
          reshapedX[b][c][a] = (int16_t)((Xline[a * 64 + b * 8 + c * 1]) * scale["X"]);
 #endif
        }
      }
    }
    int16_t pred[1][32];
    RNNPool(reshapedX, pred);
    for (int j = 0; j < 32; j++) {
      float val = ((float)pred[0][j]) / pow(2, scale["Y"]);
      Yfile.write((char*)&val, sizeof(float));
    }
  }
  Xfile.close();
  Yfile.close();
  return 0;
 }
--- a/pytorch/edgeml_pytorch/graph/rnn.py
+++ b/pytorch/edgeml_pytorch/graph/rnn.py
@ -144,8 +144,8 @@ class RNNCell(nn.Module):
    def get_model_size(self):
        '''
-		Function to get aimed model size
+        Function to get aimed model size
-		'''
+        '''
        mats = self.getVars()
        endW = self._num_W_matrices
        endU = endW + self._num_U_matrices
@ -261,7 +261,7 @@ class FastGRNNCell(RNNCell):
        self.zeta = nn.Parameter(self._zetaInit * torch.ones([1, 1]))
        self.nu = nn.Parameter(self._nuInit * torch.ones([1, 1]))
-        self.copy_previous_UW()
+        # self.copy_previous_UW()
    @property
    def name(self):
@ -330,7 +330,7 @@ class FastGRNNCUDACell(RNNCell):
    '''
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid", 
    update_nonlinearity="tanh", wRank=None, uRank=None, zetaInit=1.0, nuInit=-4.0, wSparsity=1.0, uSparsity=1.0, name="FastGRNNCUDACell"):
-        super(FastGRNNCUDACell, self).__init__(input_size, hidden_size, gate_non_linearity, update_nonlinearity, 
+        super(FastGRNNCUDACell, self).__init__(input_size, hidden_size, gate_nonlinearity, update_nonlinearity, 
                                                1, 1, 2, wRank, uRank, wSparsity, uSparsity)
        if utils.findCUDA() is None:
            raise Exception('FastGRNNCUDA is supported only on GPU devices.')
@ -967,63 +967,115 @@ class BaseRNN(nn.Module):
    [batchSize, timeSteps, inputDims]
    '''
-    def __init__(self, cell: RNNCell, batch_first=False):
+    def __init__(self, cell: RNNCell, batch_first=False, cell_reverse: RNNCell=None, bidirectional=False):
        super(BaseRNN, self).__init__()
-        self._RNNCell = cell
+        self.RNNCell = cell 
        self._batch_first = batch_first
        self._bidirectional = bidirectional
        if cell_reverse is not None:
            self.RNNCell_reverse = cell_reverse
        elif self._bidirectional:
            self.RNNCell_reverse = cell
    def getVars(self):
-        return self._RNNCell.getVars()
+        return self.RNNCell.getVars()
    def forward(self, input, hiddenState=None,
                cellState=None):
        self.device = input.device
        self.num_directions = 2 if self._bidirectional else 1
        # hidden
        # for i in range(num_directions):
        hiddenStates = torch.zeros(
                [input.shape[0], input.shape[1],
-                 self._RNNCell.output_size]).to(self.device)
+                 self.RNNCell.output_size]).to(self.device)
        if self._bidirectional:
                hiddenStates_reverse = torch.zeros(
                    [input.shape[0], input.shape[1],
                     self.RNNCell_reverse.output_size]).to(self.device)
        if hiddenState is None:
                hiddenState = torch.zeros(
-                    [input.shape[0] if self._batch_first else input.shape[1],
+                    [self.num_directions, input.shape[0] if self._batch_first else input.shape[1],
-                    self._RNNCell.output_size]).to(self.device)
+                    self.RNNCell.output_size]).to(self.device)
        if self._batch_first is True:
-            if self._RNNCell.cellType == "LSTMLR":
+            if self.RNNCell.cellType == "LSTMLR":
                cellStates = torch.zeros(
                    [input.shape[0], input.shape[1],
-                     self._RNNCell.output_size]).to(self.device)
+                     self.RNNCell.output_size]).to(self.device)
                if self._bidirectional:
                    cellStates_reverse = torch.zeros(
                    [input.shape[0], input.shape[1],
                     self.RNNCell_reverse.output_size]).to(self.device)
                if cellState is None:
                    cellState = torch.zeros(
-                        [input.shape[0], self._RNNCell.output_size]).to(self.device)
+                        [self.num_directions, input.shape[0], self.RNNCell.output_size]).to(self.device)
                for i in range(0, input.shape[1]):
-                    hiddenState, cellState = self._RNNCell(
+                    hiddenState[0], cellState[0] = self.RNNCell(
-                        input[:, i, :], (hiddenState, cellState))
+                        input[:, i, :], (hiddenState[0].clone(), cellState[0].clone()))
-                    hiddenStates[:, i, :] = hiddenState
+                    hiddenStates[:, i, :] = hiddenState[0]
-                    cellStates[:, i, :] = cellState
+                    cellStates[:, i, :] = cellState[0]
-                return hiddenStates, cellStates
+                    if self._bidirectional:
                        hiddenState[1], cellState[1] = self.RNNCell_reverse(
                            input[:, input.shape[1]-i-1, :], (hiddenState[1].clone(), cellState[1].clone()))
                        hiddenStates_reverse[:, i, :] = hiddenState[1]
                        cellStates_reverse[:, i, :] = cellState[1]
                if not self._bidirectional:
                    return hiddenStates, cellStates
                else:
                    return torch.cat([hiddenStates,hiddenStates_reverse],-1), torch.cat([cellStates,cellStates_reverse],-1)  
            else:
                for i in range(0, input.shape[1]):
-                    hiddenState = self._RNNCell(input[:, i, :], hiddenState)
+                    hiddenState[0] = self.RNNCell(input[:, i, :], hiddenState[0].clone())
-                    hiddenStates[:, i, :] = hiddenState
+                    hiddenStates[:, i, :] = hiddenState[0]
-                return hiddenStates
+                    if self._bidirectional:
                        hiddenState[1] = self.RNNCell_reverse(
                            input[:, input.shape[1]-i-1, :], hiddenState[1].clone())
                        hiddenStates_reverse[:, i, :] = hiddenState[1]
                if not self._bidirectional:
                    return hiddenStates
                else:
                    return torch.cat([hiddenStates,hiddenStates_reverse],-1)
        else:
-            if self._RNNCell.cellType == "LSTMLR":
+            if self.RNNCell.cellType == "LSTMLR":
                cellStates = torch.zeros(
                    [input.shape[0], input.shape[1],
-                     self._RNNCell.output_size]).to(self.device)
+                     self.RNNCell.output_size]).to(self.device)
                if self._bidirectional:
                    cellStates_reverse = torch.zeros(
                    [input.shape[0], input.shape[1],
                     self.RNNCell_reverse.output_size]).to(self.device)
                if cellState is None:
                    cellState = torch.zeros(
-                        [input.shape[1], self._RNNCell.output_size]).to(self.device)
+                        [self.num_directions, input.shape[1], self.RNNCell.output_size]).to(self.device)
                for i in range(0, input.shape[0]):
-                    hiddenState, cellState = self._RNNCell(
+                    hiddenState[0], cellState[0] = self.RNNCell(
-                        input[i, :, :], (hiddenState, cellState))
+                        input[i, :, :], (hiddenState[0].clone(), cellState[0].clone()))
-                    hiddenStates[i, :, :] = hiddenState
+                    hiddenStates[i, :, :] = hiddenState[0]
-                    cellStates[i, :, :] = cellState
+                    cellStates[i, :, :] = cellState[0]
-                return hiddenStates, cellStates
+                    if self._bidirectional:
                        hiddenState[1], cellState[1] = self.RNNCell_reverse(
                            input[input.shape[0]-i-1, :, :], (hiddenState[1].clone(), cellState[1].clone()))
                        hiddenStates_reverse[i, :, :] = hiddenState[1]
                        cellStates_reverse[i, :, :] = cellState[1]
                if not self._bidirectional:
                    return hiddenStates, cellStates
                else:
                    return torch.cat([hiddenStates,hiddenStates_reverse],-1), torch.cat([cellStates,cellStates_reverse],-1)
            else:
                for i in range(0, input.shape[0]):
-                    hiddenState = self._RNNCell(input[i, :, :], hiddenState)
+                    hiddenState[0] = self.RNNCell(input[i, :, :], hiddenState[0].clone())
-                    hiddenStates[i, :, :] = hiddenState
+                    hiddenStates[i, :, :] = hiddenState[0]
-                return hiddenStates
+                    if self._bidirectional:
                        hiddenState[1] = self.RNNCell_reverse(
                            input[input.shape[0]-i-1, :, :], hiddenState[1].clone())
                        hiddenStates_reverse[i, :, :] = hiddenState[1]
                if not self._bidirectional:
                    return hiddenStates
                else:
                    return torch.cat([hiddenStates,hiddenStates_reverse],-1)
 class LSTM(nn.Module):
@ -1031,14 +1083,26 @@ class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid",
                 update_nonlinearity="tanh", wRank=None, uRank=None,
-                 wSparsity=1.0, uSparsity=1.0, batch_first=False):
+                 wSparsity=1.0, uSparsity=1.0, batch_first=False, 
                 bidirectional=False, is_shared_bidirectional=True):
        super(LSTM, self).__init__()
        self._bidirectional = bidirectional
        self._batch_first = batch_first
        self._is_shared_bidirectional = is_shared_bidirectional
        self.cell = LSTMLRCell(input_size, hidden_size,
                               gate_nonlinearity=gate_nonlinearity,
                               update_nonlinearity=update_nonlinearity,
                               wRank=wRank, uRank=uRank,
                               wSparsity=wSparsity, uSparsity=uSparsity)
-        self.unrollRNN = BaseRNN(self.cell, batch_first=batch_first)
+        self.unrollRNN = BaseRNN(self.cell, batch_first=self._batch_first, bidirectional=self._bidirectional)
        if self._bidirectional is True and self._is_shared_bidirectional is False:
            self.cell_reverse = LSTMLRCell(input_size, hidden_size,
                               gate_nonlinearity=gate_nonlinearity,
                               update_nonlinearity=update_nonlinearity,
                               wRank=wRank, uRank=uRank,
                               wSparsity=wSparsity, uSparsity=uSparsity)
            self.unrollRNN = BaseRNN(self.cell, self.cell_reverse, batch_first=self._batch_first, bidirectional=self._bidirectional)
    def forward(self, input, hiddenState=None, cellState=None):
        return self.unrollRNN(input, hiddenState, cellState)
@ -1049,14 +1113,26 @@ class GRU(nn.Module):
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid",
                 update_nonlinearity="tanh", wRank=None, uRank=None,
-                 wSparsity=1.0, uSparsity=1.0, batch_first=False):
+                 wSparsity=1.0, uSparsity=1.0, batch_first=False, 
                 bidirectional=False, is_shared_bidirectional=True):
        super(GRU, self).__init__()
        self._bidirectional = bidirectional
        self._batch_first = batch_first
        self._is_shared_bidirectional = is_shared_bidirectional
        self.cell = GRULRCell(input_size, hidden_size,
                              gate_nonlinearity=gate_nonlinearity,
                              update_nonlinearity=update_nonlinearity,
                              wRank=wRank, uRank=uRank,
                              wSparsity=wSparsity, uSparsity=uSparsity)
-        self.unrollRNN = BaseRNN(self.cell, batch_first=batch_first)
+        self.unrollRNN = BaseRNN(self.cell, batch_first=self._batch_first, bidirectional=self._bidirectional)
        if self._bidirectional is True and self._is_shared_bidirectional is False:
            self.cell_reverse = GRULRCell(input_size, hidden_size,
                              gate_nonlinearity=gate_nonlinearity,
                              update_nonlinearity=update_nonlinearity,
                              wRank=wRank, uRank=uRank,
                              wSparsity=wSparsity, uSparsity=uSparsity)
            self.unrollRNN = BaseRNN(self.cell, self.cell_reverse, batch_first=self._batch_first, bidirectional=self._bidirectional)
    def forward(self, input, hiddenState=None, cellState=None):
        return self.unrollRNN(input, hiddenState, cellState)
@ -1067,14 +1143,26 @@ class UGRNN(nn.Module):
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid",
                 update_nonlinearity="tanh", wRank=None, uRank=None,
-                 wSparsity=1.0, uSparsity=1.0, batch_first=False):
+                 wSparsity=1.0, uSparsity=1.0, batch_first=False, 
                 bidirectional=False, is_shared_bidirectional=True):
        super(UGRNN, self).__init__()
        self._bidirectional = bidirectional
        self._batch_first = batch_first
        self._is_shared_bidirectional = is_shared_bidirectional
        self.cell = UGRNNLRCell(input_size, hidden_size,
                                gate_nonlinearity=gate_nonlinearity,
                                update_nonlinearity=update_nonlinearity,
                                wRank=wRank, uRank=uRank,
                                wSparsity=wSparsity, uSparsity=uSparsity)
-        self.unrollRNN = BaseRNN(self.cell, batch_first=batch_first)
+        self.unrollRNN = BaseRNN(self.cell, batch_first=self._batch_first, bidirectional=self._bidirectional)
        if self._bidirectional is True and self._is_shared_bidirectional is False:
            self.cell_reverse = UGRNNLRCell(input_size, hidden_size,
                                gate_nonlinearity=gate_nonlinearity,
                                update_nonlinearity=update_nonlinearity,
                                wRank=wRank, uRank=uRank,
                                wSparsity=wSparsity, uSparsity=uSparsity)
            self.unrollRNN = BaseRNN(self.cell, self.cell_reverse, batch_first=self._batch_first, bidirectional=self._bidirectional)
    def forward(self, input, hiddenState=None, cellState=None):
        return self.unrollRNN(input, hiddenState, cellState)
@ -1085,15 +1173,28 @@ class FastRNN(nn.Module):
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid",
                 update_nonlinearity="tanh", wRank=None, uRank=None,
-                 wSparsity=1.0, uSparsity=1.0, alphaInit=-3.0, betaInit=3.0, batch_first=False):
+                 wSparsity=1.0, uSparsity=1.0, alphaInit=-3.0, betaInit=3.0,
                 batch_first=False, bidirectional=False, is_shared_bidirectional=True):
        super(FastRNN, self).__init__()
        self._bidirectional = bidirectional
        self._batch_first = batch_first
        self._is_shared_bidirectional = is_shared_bidirectional
        self.cell = FastRNNCell(input_size, hidden_size,
                                gate_nonlinearity=gate_nonlinearity,
                                update_nonlinearity=update_nonlinearity,
                                wRank=wRank, uRank=uRank,
                                wSparsity=wSparsity, uSparsity=uSparsity,
                                alphaInit=alphaInit, betaInit=betaInit)
-        self.unrollRNN = BaseRNN(self.cell, batch_first=batch_first)
+        self.unrollRNN = BaseRNN(self.cell, batch_first=self._batch_first, bidirectional=self._bidirectional)
        if self._bidirectional is True and self._is_shared_bidirectional is False:
            self.cell_reverse = FastRNNCell(input_size, hidden_size,
                                gate_nonlinearity=gate_nonlinearity,
                                update_nonlinearity=update_nonlinearity,
                                wRank=wRank, uRank=uRank,
                                wSparsity=wSparsity, uSparsity=uSparsity,
                                alphaInit=alphaInit, betaInit=betaInit)
            self.unrollRNN = BaseRNN(self.cell, self.cell_reverse, batch_first=self._batch_first, bidirectional=self._bidirectional)
    def forward(self, input, hiddenState=None, cellState=None):
        return self.unrollRNN(input, hiddenState, cellState)
@ -1105,15 +1206,27 @@ class FastGRNN(nn.Module):
    def __init__(self, input_size, hidden_size, gate_nonlinearity="sigmoid",
                 update_nonlinearity="tanh", wRank=None, uRank=None,
                 wSparsity=1.0, uSparsity=1.0, zetaInit=1.0, nuInit=-4.0,
-                 batch_first=False):
+                 batch_first=False, bidirectional=False, is_shared_bidirectional=True):
        super(FastGRNN, self).__init__()
        self._bidirectional = bidirectional
        self._batch_first = batch_first
        self._is_shared_bidirectional = is_shared_bidirectional
        self.cell = FastGRNNCell(input_size, hidden_size,
                                 gate_nonlinearity=gate_nonlinearity,
                                 update_nonlinearity=update_nonlinearity,
                                 wRank=wRank, uRank=uRank,
                                 wSparsity=wSparsity, uSparsity=uSparsity,
                                 zetaInit=zetaInit, nuInit=nuInit)
-        self.unrollRNN = BaseRNN(self.cell, batch_first=batch_first)
+        self.unrollRNN = BaseRNN(self.cell, batch_first=self._batch_first, bidirectional=self._bidirectional)
        if self._bidirectional is True and self._is_shared_bidirectional is False:
            self.cell_reverse = FastGRNNCell(input_size, hidden_size,
                                 gate_nonlinearity=gate_nonlinearity,
                                 update_nonlinearity=update_nonlinearity,
                                 wRank=wRank, uRank=uRank,
                                 wSparsity=wSparsity, uSparsity=uSparsity,
                                 zetaInit=zetaInit, nuInit=nuInit)
            self.unrollRNN = BaseRNN(self.cell, self.cell_reverse, batch_first=self._batch_first, bidirectional=self._bidirectional)
    def getVars(self):
        return self.unrollRNN.getVars()
@ -1222,8 +1335,8 @@ class FastGRNNCUDA(nn.Module):
    def get_model_size(self):
        '''
-		Function to get aimed model size
+        Function to get aimed model size
-		'''
+        '''
        mats = self.getVars()
        endW = self._num_W_matrices
        endU = endW + self._num_U_matrices
--- a/pytorch/edgeml_pytorch/graph/rnnpool.py
+++ b/pytorch/edgeml_pytorch/graph/rnnpool.py
@ -0,0 +1,66 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT license.
 import torch
 import torch.nn as nn
 import numpy as np
 from edgeml_pytorch.graph.rnn import *
 class RNNPool(nn.Module):
    def __init__(self, nRows, nCols, nHiddenDims,
                     nHiddenDimsBiDir, inputDims):
        super(RNNPool, self).__init__()
        self.nRows = nRows
        self.nCols = nCols
        self.inputDims = inputDims
        self.nHiddenDims = nHiddenDims
        self.nHiddenDimsBiDir = nHiddenDimsBiDir
        self._build()
    def _build(self):
        self.cell_rnn = FastGRNN(self.inputDims, self.nHiddenDims, gate_nonlinearity="sigmoid",
                                update_nonlinearity="tanh", zetaInit=100.0, nuInit=-100.0,
                                batch_first=False, bidirectional=False)
        self.cell_bidirrnn = FastGRNN(self.nHiddenDims, self.nHiddenDimsBiDir, gate_nonlinearity="sigmoid",
                                update_nonlinearity="tanh", zetaInit=100.0, nuInit=-100.0,
                                batch_first=False, bidirectional=True, is_shared_bidirectional=True)
    def static_single(self,inputs, hidden, batch_size):
        outputs = self.cell_rnn(inputs, hidden[0], hidden[1])
        return torch.split(outputs[-1], split_size_or_sections=batch_size, dim=0)
    def forward(self,inputs,batch_size):
        ## across rows
        row_timestack = torch.cat(torch.unbind(inputs, dim=3),dim=0) 
        stateList = self.static_single(torch.stack(torch.unbind(row_timestack,dim=2)),
                        (torch.zeros(1, batch_size * self.nRows, self.nHiddenDims).to(torch.device("cuda")),
                        torch.zeros(1, batch_size * self.nRows, self.nHiddenDims).to(torch.device("cuda"))),batch_size)       
        outputs_cols = self.cell_bidirrnn(torch.stack(stateList),
                        torch.zeros(2, batch_size, self.nHiddenDimsBiDir).to(torch.device("cuda")),
                        torch.zeros(2, batch_size, self.nHiddenDimsBiDir).to(torch.device("cuda")))
        ## across columns
        col_timestack = torch.cat(torch.unbind(inputs, dim=2),dim=0)
        stateList = self.static_single(torch.stack(torch.unbind(col_timestack,dim=2)),
                        (torch.zeros(1, batch_size * self.nRows, self.nHiddenDims).to(torch.device("cuda")),
                        torch.zeros(1, batch_size * self.nRows, self.nHiddenDims).to(torch.device("cuda"))),batch_size)
        outputs_rows = self.cell_bidirrnn(torch.stack(stateList),
                        torch.zeros(2, batch_size, self.nHiddenDimsBiDir).to(torch.device("cuda")),
                        torch.zeros(2, batch_size, self.nHiddenDimsBiDir).to(torch.device("cuda")))
        output = torch.cat([outputs_rows[-1],outputs_cols[-1]],1)
        return output