init

2018-08-02 04:00:29 +00:00 · 2018-08-02 04:00:29 +00:00 · f926dffb45
--- a/.gitignore
+++ b/.gitignore
@ -14,8 +14,6 @@ dist/
 downloads/
 eggs/
 .eggs/
-lib/
-lib64/
 parts/
 sdist/
 var/
@ -102,3 +100,8 @@ venv.bak/

 # mypy
 .mypy_cache/
+
+/data
+/output
+/models
+/log
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,14 @@
+
+# Contributing
+
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.microsoft.com.
+
+When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
+contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/README.md
+++ b/README.md
@ -1,14 +1,188 @@
+# Simple Baselines for Pose Estimation and Pose Tracking

-# Contributing
+## Introduction
+This is an official pytorch implementation of [*Simple Baselines for Pose Estimation and Pose Tracking*](https://arxiv.org/abs/1804.06208). This work provides baseline methods that are surprisingly simple and effective, thus helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. On COCO keypoints valid dataset, our best **single model** achieves **74.3 of mAP**. You can reproduce our results using this repo. All models are provided for research purpose.    </br>

-This project welcomes contributions and suggestions.  Most contributions require you to agree to a
-Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
-the rights to use your contribution. For details, visit https://cla.microsoft.com.
+## Main Results
+### Results on MPII val
+| arch | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | Mean | Mean@0.1|
+|---|---|---|---|---|---|---|---|---|---|
+| 256x256_pose_resnet_50_d256d256d256 | 96.351 | 95.329 | 88.989 | 83.176 | 88.420 | 83.960 | 79.594 | 88.532 | 33.911 |
+| 384x384_pose_resnet_50_d256d256d256 | 96.658 | 95.754 | 89.790 | 84.614 | 88.523 | 84.666 | 79.287 | 89.066 | 38.046 |
+| 256x256_pose_resnet_101_d256d256d256 | 96.862 | 95.873 | 89.518 | 84.376 | 88.437 | 84.486 | 80.703 | 89.131 | 34.020 |
+| 384x384_pose_resnet_101_d256d256d256 | 96.965 | 95.907 | 90.268 | 85.780 | 89.597 | 85.935 | 82.098 | 90.003 | 38.860 |
+| 256x256_pose_resnet_152_d256d256d256 | 97.033 | 95.941 | 90.046 | 84.976 | 89.164 | 85.311 | 81.271 | 89.620 | 35.025 |
+| 384x384_pose_resnet_152_d256d256d256 | 96.794 | 95.618 | 90.080 | 86.225 | 89.700 | 86.862 | 82.853 | 90.200 | 39.433 |

-When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
-a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
-provided by the bot. You will only need to do this once across all repos using our CLA.
+### Note:
+- Flip test is used

-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+### Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset
+| Arch | AP | Ap .5 | AP .75 | AP (M) | AP (L) | AR | AR .5 | AR .75 | AR (M) | AR (L) |
+|---|---|---|---|---|---|---|---|---|---|---|---|
+| 256x192_pose_resnet_50_d256d256d256 | 0.704 | 0.886 | 0.783 | 0.671 | 0.772 | 0.763 | 0.929 | 0.834 | 0.721 | 0.824 |
+| 384x288_pose_resnet_50_d256d256d256 | 0.722 | 0.893 | 0.789 | 0.681 | 0.797 | 0.776 | 0.932 | 0.838 | 0.728 | 0.846 |
+| 256x192_pose_resnet_101_d256d256d256 | 0.714 | 0.893 | 0.793 | 0.681 | 0.781 | 0.771 | 0.934 | 0.840 | 0.730 | 0.832 |
+| 384x288_pose_resnet_101_d256d256d256 | 0.736 | 0.896 | 0.803 | 0.699 | 0.811 | 0.791 | 0.936 | 0.851 | 0.745 | 0.858 |
+| 256x192_pose_resnet_152_d256d256d256 | 0.720 | 0.893 | 0.798 | 0.687 | 0.789 | 0.778 | 0.934 | 0.846 | 0.736 | 0.839 |
+| 384x288_pose_resnet_152_d256d256d256 | 0.743 | 0.896 | 0.811 | 0.705 | 0.816 | 0.797 | 0.937 | 0.858 | 0.751 | 0.863 |
+
+### Note:
+- Flip test is used
+- Person detector has person AP of 56.4 on COCO val2017 dataset 
+
+## Environment
+The code is developed using python3.6 on Ubutnu16.04. NVIDIA GPUs ared needed. The code is developed and tested using 4 NVIDIA P100 GPUS cards. Other platform or GPU card are not fully tested.
+
+## Quick start
+### Installation
+1. Install pytorch >= v0.4.0 following [official instruction](https://pytorch.org/)
+2. Disable cudnn for batch_norm
+   ```
+   # PYTORCH=/path/to/pytorch
+   # for pytorch v0.4.0
+   sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYTORCH}/torch/nn/functional.py
+   # for pytorch v0.4.1
+   sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYTORCH}/torch/nn/functional.py
+   ```
+   Note that instructions like # PYTORCH=/path/to/pytorch indicate that you should pick a path where you'd like to have pytorch installed  and then set an environment variable (PYTORCH in this case) accordingly.
+1. Clone this repo, and we'll call the directory that you cloned as ${POSE_ROOT}
+2. Install dependencies.
+   ```
+   pip install -r requirements.txt
+   ```
+3. Install [COCOAPI](https://github.com/cocodataset/cocoapi):
+   ```
+   # COCOAPI=/path/to/clone/cocoapi
+   git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
+   cd $COCOAPI/PythonAPI
+   # Install into global site-packages
+   make install
+   # Alternatively, if you do not have permissions or prefer
+   # not to install the COCO API into global site-packages
+   python3 setup.py install --user
+   ```
+   Note that instructions like # COCOAPI=/path/to/install/cocoapi indicate that you should pick a path where you'd like to have the software cloned and then set an environment variable (COCOAPI in this case) accordingly.
+3. Download pytorch imagenet pretrained models from [pytorch model zoo](https://pytorch.org/docs/stable/model_zoo.html#module-torch.utils.model_zoo). 
+4. Download mpii and coco pretrained model from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blW0D5ZE4ArK9wk_fvw). Please download them under ${POSE_ROOT}/models/pytorch, and make them look like this:
+
+   ```
+   ${POSE_ROOT}
+    `-- models
+        `-- pytorch
+            |-- imagenet
+            |   |-- resnet50-19c8e357.pth
+            |   |-- resnet101-5d3b4d8f.pth
+            |   `-- resnet152-b121ed2d.pth
+            |-- pose_coco
+            |   |-- pose_resnet_101_256x192.pth.tar
+            |   |-- pose_resnet_101_384x288.pth.tar
+            |   |-- pose_resnet_152_256x192.pth.tar
+            |   |-- pose_resnet_152_384x288.pth.tar
+            |   |-- pose_resnet_50_256x192.pth.tar
+            |   `-- pose_resnet_50_384x288.pth.tar
+            `-- pose_mpii
+                |-- pose_resnet_101_256x256.pth.tar
+                |-- pose_resnet_101_384x384.pth.tar
+                |-- pose_resnet_152_256x256.pth.tar
+                |-- pose_resnet_152_384x384.pth.tar
+                |-- pose_resnet_50_256x256.pth.tar
+                `-- pose_resnet_50_384x384.pth.tar
+
+   ```
+
+4. Init output(training model output directory) and log(tensorboard log directory) directory.
+
+   ```
+   mkdir ouput 
+   mkdir log
+   ```
+
+   and your directory tree should like this
+
+   ```
+   ${POSE_ROOT}
+   ├── data
+   ├── experiments
+   ├── lib
+   ├── log
+   ├── models
+   ├── output
+   ├── pose_estimation
+   ├── README.md
+   └── requirements.txt
+   ```
+   
+### Data preparation
+**For MPII data**, please download from [MPII Human Pose Dataset](http://human-pose.mpi-inf.mpg.de/), the original annotation files are matlab's format. We have converted to json format, you also need download them from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blW00SqrairNetmeVu4).
+Extract them under {POSE_ROOT}/data, and make them look like this:
+```
+${POSE_ROOT}
+|-- data
+`-- |-- mpii
+    `-- |-- annot
+        |   |-- gt_valid.mat
+        |   |-- test.json
+        |   |-- train.json
+        |   |-- trainval.json
+        |   `-- valid.json
+        `-- images
+            |-- 000001163.jpg
+            |-- 000003072.jpg
+```
+
+**For COCO data**, please download from [COCO download](http://cocodataset.org/#download), 2017 Train/Val is needed for COCO keypoints training and validation. We also provide person detection result of COCO val2017 for reproduce our multi-person pose estimation results. Please download from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blWzzA1A-y1AH-pZQdS).
+Download and extract them under {POSE_ROOT}/data, and make them look like this:
+```
+${POSE_ROOT}
+|-- data
+`-- |-- coco
+    `-- |-- annotations
+        |   |-- person_keypoints_train2017.json
+        |   `-- person_keypoints_val2017.json
+        |-- person_detection_results
+        |   |-- COCO_val2017_detections_AP_H_56_person.json
+        `-- images
+            |-- train2017
+            |   |-- 000000000009.jpg
+            |   |-- 000000000025.jpg
+            |   |-- 000000000030.jpg
+            |   |-- ... 
+            `-- val2017
+                |-- 000000000139.jpg
+                |-- 000000000285.jpg
+                |-- 000000000632.jpg
+                |-- ... 
+```
+
+### Valid on MPII using pretrained models
+
+```
+python pose_estimation/valid.py \
+    --cfg experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml \
+    --flip-test \
+    --model-file models/pytorch/pose_mpii/pose_resnet_50_256x256.pth.tar
+```
+
+### Training on MPII
+
+```
+python pose_estimation/train.py \
+    --cfg experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml
+```
+
+### Valid on COCO val2017 using pretrained models
+
+```
+python pose_estimation/valid.py \
+    --cfg experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml \
+    --flip-test \
+    --model-file models/pytorch/pose_mpii/pose_resnet_50_256x256.pth.tar
+```
+
+### Training on COCO train2017
+
+```
+python pose_estimation/train.py \
+    --cfg experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml
+```
--- a/experiments/coco/resnet101/256x192_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet101/256x192_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet101-5d3b4d8f.pth'
+  IMAGE_SIZE:
+  - 192
+  - 256
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 48
+    - 64
+    SIGMA: 2
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 101
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/coco/resnet101/384x288_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet101/384x288_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet101-5d3b4d8f.pth'
+  IMAGE_SIZE:
+  - 288
+  - 384
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 72
+    - 96
+    SIGMA: 3
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 101
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/coco/resnet152/256x192_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet152/256x192_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet152-b121ed2d.pth'
+  IMAGE_SIZE:
+  - 192
+  - 256
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 48
+    - 64
+    SIGMA: 2
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 152
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/coco/resnet152/384x288_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet152/384x288_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet152-b121ed2d.pth'
+  IMAGE_SIZE:
+  - 288
+  - 384
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 72
+    - 96
+    SIGMA: 3
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 152
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet50/256x192_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet50-19c8e357.pth'
+  IMAGE_SIZE:
+  - 192
+  - 256
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 48
+    - 64
+    SIGMA: 2
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 50
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120 
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/coco/resnet50/384x288_d256x3_adam_lr1e-3.yaml
+++ b/experiments/coco/resnet50/384x288_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,76 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: 'coco'
+  ROOT: 'data/coco/'
+  TEST_SET: 'val2017'
+  TRAIN_SET: 'train2017'
+  FLIP: true
+  ROT_FACTOR: 40
+  SCALE_FACTOR: 0.3
+MODEL:
+  NAME: 'pose_resnet'
+  PRETRAINED: 'models/pytorch/imagenet/resnet50-19c8e357.pth'
+  IMAGE_SIZE:
+  - 288
+  - 384
+  NUM_JOINTS: 17
+  EXTRA:
+    TARGET_TYPE: 'gaussian'
+    HEATMAP_SIZE:
+    - 72
+    - 96
+    SIGMA: 3
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 50
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: 'adam'
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  COCO_BBOX_FILE: 'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
+  BBOX_THRE: 1.0
+  FLIP_TEST: false
+  IMAGE_THRE: 0.0
+  IN_VIS_THRE: 0.2
+  MODEL_FILE: ''
+  NMS_THRE: 1.0
+  OKS_THRE: 0.9
+  USE_GT_BBOX: true
+DEBUG:
+  DEBUG: true
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet101/256x256_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet101/256x256_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,72 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+CUDNN:
+  BENCHMARK: True
+  DETERMINISTIC: False
+  ENABLED: True
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet101-5d3b4d8f.pth'
+  IMAGE_SIZE:
+  - 256
+  - 256
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    SIGMA: 2
+    HEATMAP_SIZE:
+    - 64
+    - 64
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 101
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet101/384x384_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet101/384x384_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,69 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet101-5d3b4d8f.pth'
+  IMAGE_SIZE:
+  - 384
+  - 384
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    HEATMAP_SIZE:
+    - 96
+    - 96
+    SIGMA: 3
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 101
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet152/256x256_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet152/256x256_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,72 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+CUDNN:
+  BENCHMARK: True
+  DETERMINISTIC: False
+  ENABLED: True
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet152-b121ed2d.pth'
+  IMAGE_SIZE:
+  - 256
+  - 256
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    SIGMA: 2
+    HEATMAP_SIZE:
+    - 64
+    - 64
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 152
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet152/384x384_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet152/384x384_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,72 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+CUDNN:
+  BENCHMARK: True
+  DETERMINISTIC: False
+  ENABLED: True
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet152-b121ed2d.pth'
+  IMAGE_SIZE:
+  - 256
+  - 256
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    SIGMA: 2
+    HEATMAP_SIZE:
+    - 64
+    - 64
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 152
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 24
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet50/256x256_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,72 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+CUDNN:
+  BENCHMARK: True
+  DETERMINISTIC: False
+  ENABLED: True
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet50-19c8e357.pth'
+  IMAGE_SIZE:
+  - 256
+  - 256
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    SIGMA: 2
+    HEATMAP_SIZE:
+    - 64
+    - 64
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 50
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/experiments/mpii/resnet50/384x384_d256x3_adam_lr1e-3.yaml
+++ b/experiments/mpii/resnet50/384x384_d256x3_adam_lr1e-3.yaml
@ -0,0 +1,72 @@
+GPUS: '0'
+DATA_DIR: ''
+OUTPUT_DIR: 'output'
+LOG_DIR: 'log'
+WORKERS: 4
+PRINT_FREQ: 100
+CUDNN:
+  BENCHMARK: True
+  DETERMINISTIC: False
+  ENABLED: True
+DATASET:
+  DATASET: mpii
+  ROOT: 'data/mpii/'
+  TEST_SET: valid
+  TRAIN_SET: train
+  FLIP: true
+  ROT_FACTOR: 30
+  SCALE_FACTOR: 0.25
+MODEL:
+  NAME: pose_resnet
+  PRETRAINED: 'models/pytorch/imagenet/resnet50-19c8e357.pth'
+  IMAGE_SIZE:
+  - 256
+  - 256
+  NUM_JOINTS: 16
+  EXTRA:
+    TARGET_TYPE: gaussian
+    SIGMA: 2
+    HEATMAP_SIZE:
+    - 64
+    - 64
+    FINAL_CONV_KERNEL: 1
+    DECONV_WITH_BIAS: false
+    NUM_DECONV_LAYERS: 3
+    NUM_DECONV_FILTERS:
+    - 256
+    - 256
+    - 256
+    NUM_DECONV_KERNELS:
+    - 4
+    - 4
+    - 4
+    NUM_LAYERS: 50
+LOSS:
+  USE_TARGET_WEIGHT: true
+TRAIN:
+  BATCH_SIZE: 32
+  SHUFFLE: true
+  BEGIN_EPOCH: 0
+  END_EPOCH: 140
+  RESUME: false
+  OPTIMIZER: adam
+  LR: 0.001
+  LR_FACTOR: 0.1
+  LR_STEP:
+  - 90
+  - 120
+  WD: 0.0001
+  GAMMA1: 0.99
+  GAMMA2: 0.0
+  MOMENTUM: 0.9
+  NESTEROV: false
+TEST:
+  BATCH_SIZE: 32
+  FLIP_TEST: false
+  MODEL_FILE: ''
+DEBUG:
+  DEBUG: false
+  SAVE_BATCH_IMAGES_GT: true
+  SAVE_BATCH_IMAGES_PRED: true
+  SAVE_HEATMAPS_GT: true
+  SAVE_HEATMAPS_PRED: true
--- a/lib/Makefile
+++ b/lib/Makefile
@ -0,0 +1,4 @@
+all:
+	cd nms; python setup.py build_ext --inplace; rm -rf build; cd ../../
+clean:
+	cd nms; rm *.so; cd ../../
--- a/lib/core/config.py
+++ b/lib/core/config.py
@ -0,0 +1,225 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import yaml
+
+import numpy as np
+from easydict import EasyDict as edict
+
+
+config = edict()
+
+config.OUTPUT_DIR = ''
+config.LOG_DIR = ''
+config.DATA_DIR = ''
+config.GPUS = '0'
+config.WORKERS = 4
+config.PRINT_FREQ = 20
+
+# Cudnn related params
+config.CUDNN = edict()
+config.CUDNN.BENCHMARK = True
+config.CUDNN.DETERMINISTIC = False
+config.CUDNN.ENABLED = True
+
+# pose_resnet related params
+POSE_RESNET = edict()
+POSE_RESNET.NUM_LAYERS = 50
+POSE_RESNET.DECONV_WITH_BIAS = False
+POSE_RESNET.NUM_DECONV_LAYERS = 3
+POSE_RESNET.NUM_DECONV_FILTERS = [256, 256, 256]
+POSE_RESNET.NUM_DECONV_KERNELS = [4, 4, 4]
+POSE_RESNET.FINAL_CONV_KERNEL = 1
+POSE_RESNET.TARGET_TYPE = 'gaussian'
+POSE_RESNET.HEATMAP_SIZE = [64, 64]  # width * height, ex: 24 * 32
+POSE_RESNET.SIGMA = 2
+
+MODEL_EXTRAS = {
+    'pose_resnet': POSE_RESNET,
+}
+
+# common params for NETWORK
+config.MODEL = edict()
+config.MODEL.NAME = 'pose_resnet'
+config.MODEL.INIT_WEIGHTS = True
+config.MODEL.PRETRAINED = ''
+config.MODEL.NUM_JOINTS = 16
+config.MODEL.IMAGE_SIZE = [256, 256]  # width * height, ex: 192 * 256
+config.MODEL.EXTRA = MODEL_EXTRAS[config.MODEL.NAME]
+
+config.LOSS = edict()
+config.LOSS.USE_TARGET_WEIGHT = True
+
+# DATASET related params
+config.DATASET = edict()
+config.DATASET.ROOT = ''
+config.DATASET.DATASET = 'mpii'
+config.DATASET.TRAIN_SET = 'train'
+config.DATASET.TEST_SET = 'valid'
+config.DATASET.DATA_FORMAT = 'jpg'
+config.DATASET.HYBRID_JOINTS_TYPE = ''
+config.DATASET.SELECT_DATA = False
+
+# training data augmentation
+config.DATASET.FLIP = True
+config.DATASET.SCALE_FACTOR = 0.25
+config.DATASET.ROT_FACTOR = 30
+
+# train
+config.TRAIN = edict()
+
+config.TRAIN.LR_FACTOR = 0.1
+config.TRAIN.LR_STEP = [90, 110]
+config.TRAIN.LR = 0.001
+
+config.TRAIN.OPTIMIZER = 'adam'
+config.TRAIN.MOMENTUM = 0.9
+config.TRAIN.WD = 0.0001
+config.TRAIN.NESTEROV = False
+config.TRAIN.GAMMA1 = 0.99
+config.TRAIN.GAMMA2 = 0.0
+
+config.TRAIN.BEGIN_EPOCH = 0
+config.TRAIN.END_EPOCH = 140
+
+config.TRAIN.RESUME = False
+config.TRAIN.CHECKPOINT = ''
+
+config.TRAIN.BATCH_SIZE = 32
+config.TRAIN.SHUFFLE = True
+
+# testing
+config.TEST = edict()
+
+# size of images for each device
+config.TEST.BATCH_SIZE = 32
+# Test Model Epoch
+config.TEST.FLIP_TEST = False
+config.TEST.POST_PROCESS = True
+config.TEST.SHIFT_HEATMAP = True
+
+config.TEST.USE_GT_BBOX = False
+# nms
+config.TEST.OKS_THRE = 0.5
+config.TEST.IN_VIS_THRE = 0.0
+config.TEST.COCO_BBOX_FILE = ''
+config.TEST.BBOX_THRE = 1.0
+config.TEST.MODEL_FILE = ''
+
+# debug
+config.DEBUG = edict()
+config.DEBUG.DEBUG = False
+config.DEBUG.SAVE_BATCH_IMAGES_GT = False
+config.DEBUG.SAVE_BATCH_IMAGES_PRED = False
+config.DEBUG.SAVE_HEATMAPS_GT = False
+config.DEBUG.SAVE_HEATMAPS_PRED = False
+
+
+def _update_dict(k, v):
+    if k == 'DATASET':
+        if 'MEAN' in v and v['MEAN']:
+            v['MEAN'] = np.array([eval(x) if isinstance(x, str) else x
+                                  for x in v['MEAN']])
+        if 'STD' in v and v['STD']:
+            v['STD'] = np.array([eval(x) if isinstance(x, str) else x
+                                 for x in v['STD']])
+    if k == 'MODEL':
+        if 'EXTRA' in v and 'HEATMAP_SIZE' in v['EXTRA']:
+            if isinstance(v['EXTRA']['HEATMAP_SIZE'], int):
+                v['EXTRA']['HEATMAP_SIZE'] = np.array(
+                    [v['EXTRA']['HEATMAP_SIZE'], v['EXTRA']['HEATMAP_SIZE']])
+            else:
+                v['EXTRA']['HEATMAP_SIZE'] = np.array(
+                    v['EXTRA']['HEATMAP_SIZE'])
+        if 'IMAGE_SIZE' in v:
+            if isinstance(v['IMAGE_SIZE'], int):
+                v['IMAGE_SIZE'] = np.array([v['IMAGE_SIZE'], v['IMAGE_SIZE']])
+            else:
+                v['IMAGE_SIZE'] = np.array(v['IMAGE_SIZE'])
+    for vk, vv in v.items():
+        if vk in config[k]:
+            config[k][vk] = vv
+        else:
+            raise ValueError("{}.{} not exist in config.py".format(k, vk))
+
+
+def update_config(config_file):
+    exp_config = None
+    with open(config_file) as f:
+        exp_config = edict(yaml.load(f))
+        for k, v in exp_config.items():
+            if k in config:
+                if isinstance(v, dict):
+                    _update_dict(k, v)
+                else:
+                    if k == 'SCALES':
+                        config[k][0] = (tuple(v))
+                    else:
+                        config[k] = v
+            else:
+                raise ValueError("{} not exist in config.py".format(k))
+
+
+def gen_config(config_file):
+    cfg = dict(config)
+    for k, v in cfg.items():
+        if isinstance(v, edict):
+            cfg[k] = dict(v)
+
+    with open(config_file, 'w') as f:
+        yaml.dump(dict(cfg), f, default_flow_style=False)
+
+
+def update_dir(model_dir, log_dir, data_dir):
+    if model_dir:
+        config.OUTPUT_DIR = model_dir
+
+    if log_dir:
+        config.LOG_DIR = log_dir
+
+    if data_dir:
+        config.DATA_DIR = data_dir
+
+    config.DATASET.ROOT = os.path.join(
+            config.DATA_DIR, config.DATASET.ROOT)
+
+    config.TEST.COCO_BBOX_FILE = os.path.join(
+            config.DATA_DIR, config.TEST.COCO_BBOX_FILE)
+
+    config.MODEL.PRETRAINED = os.path.join(
+            config.DATA_DIR, config.MODEL.PRETRAINED)
+
+
+def get_model_name(cfg):
+    name = cfg.MODEL.NAME
+    full_name = cfg.MODEL.NAME
+    extra = cfg.MODEL.EXTRA
+    if name in ['pose_resnet']:
+        name = '{model}_{num_layers}'.format(
+            model=name,
+            num_layers=extra.NUM_LAYERS)
+        deconv_suffix = ''.join(
+            'd{}'.format(num_filters)
+            for num_filters in extra.NUM_DECONV_FILTERS)
+        full_name = '{height}x{width}_{name}_{deconv_suffix}'.format(
+            height=cfg.MODEL.IMAGE_SIZE[1],
+            width=cfg.MODEL.IMAGE_SIZE[0],
+            name=name,
+            deconv_suffix=deconv_suffix)
+    else:
+        raise ValueError('Unkown model: {}'.format(cfg.MODEL))
+
+    return name, full_name
+
+
+if __name__ == '__main__':
+    import sys
+    gen_config(sys.argv[1])
--- a/lib/core/evaluate.py
+++ b/lib/core/evaluate.py
@ -0,0 +1,71 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from core.inference import get_max_preds
+
+
+def calc_dists(preds, target, normalize):
+    preds = preds.astype(np.float32)
+    target = target.astype(np.float32)
+    dists = np.zeros((preds.shape[1], preds.shape[0]))
+    for n in range(preds.shape[0]):
+        for c in range(preds.shape[1]):
+            if target[n, c, 0] > 1 and target[n, c, 1] > 1:
+                normed_preds = preds[n, c, :] / normalize[n]
+                normed_targets = target[n, c, :] / normalize[n]
+                dists[c, n] = np.linalg.norm(normed_preds - normed_targets)
+            else:
+                dists[c, n] = -1
+    return dists
+
+
+def dist_acc(dists, thr=0.5):
+    ''' Return percentage below threshold while ignoring values with a -1 '''
+    dist_cal = np.not_equal(dists, -1)
+    num_dist_cal = dist_cal.sum()
+    if num_dist_cal > 0:
+        return np.less(dists[dist_cal], thr).sum() * 1.0 / num_dist_cal
+    else:
+        return -1
+
+
+def accuracy(output, target, hm_type='gaussian', thr=0.5):
+    '''
+    Calculate accuracy according to PCK,
+    but uses ground truth heatmap rather than x,y locations
+    First value to be returned is average accuracy across 'idxs',
+    followed by individual accuracies
+    '''
+    idx = list(range(output.shape[1]))
+    norm = 1.0
+    if hm_type == 'gaussian':
+        pred, _ = get_max_preds(output)
+        target, _ = get_max_preds(target)
+        h = output.shape[2]
+        w = output.shape[3]
+        norm = np.ones((pred.shape[0], 2)) * np.array([h, w]) / 10
+    dists = calc_dists(pred, target, norm)
+
+    acc = np.zeros((len(idx) + 1))
+    avg_acc = 0
+    cnt = 0
+
+    for i in range(len(idx)):
+        acc[i + 1] = dist_acc(dists[idx[i]])
+        if acc[i + 1] >= 0:
+            avg_acc = avg_acc + acc[i + 1]
+            cnt += 1
+
+    avg_acc = avg_acc / cnt if cnt != 0 else 0
+    if cnt != 0:
+        acc[0] = avg_acc
+    return acc, avg_acc, cnt, pred
--- a/lib/core/function.py
+++ b/lib/core/function.py
@ -0,0 +1,239 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import logging
+import time
+import os
+
+import numpy as np
+import torch
+
+from core.config import get_model_name
+from core.evaluate import accuracy
+from core.inference import get_final_preds
+from utils.transforms import flip_back
+from utils.vis import save_debug_images
+
+
+logger = logging.getLogger(__name__)
+
+
+def train(config, train_loader, model, criterion, optimizer, epoch,
+          output_dir, tb_log_dir, writer_dict):
+    batch_time = AverageMeter()
+    data_time = AverageMeter()
+    losses = AverageMeter()
+    acc = AverageMeter()
+
+    # switch to train mode
+    model.train()
+
+    end = time.time()
+    for i, (input, target, target_weight, meta) in enumerate(train_loader):
+        # measure data loading time
+        data_time.update(time.time() - end)
+
+        # compute output
+        output = model(input)
+        target = target.cuda(non_blocking=True)
+        target_weight = target_weight.cuda(non_blocking=True)
+
+        loss = criterion(output, target, target_weight)
+
+        # compute gradient and do update step
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        # measure accuracy and record loss
+        losses.update(loss.item(), input.size(0))
+
+        _, avg_acc, cnt, pred = accuracy(output.detach().cpu().numpy(),
+                                         target.detach().cpu().numpy())
+        acc.update(avg_acc, cnt)
+
+        # measure elapsed time
+        batch_time.update(time.time() - end)
+        end = time.time()
+
+        if i % config.PRINT_FREQ == 0:
+            msg = 'Epoch: [{0}][{1}/{2}]\t' \
+                  'Time {batch_time.val:.3f}s ({batch_time.avg:.3f}s)\t' \
+                  'Speed {speed:.1f} samples/s\t' \
+                  'Data {data_time.val:.3f}s ({data_time.avg:.3f}s)\t' \
+                  'Loss {loss.val:.5f} ({loss.avg:.5f})\t' \
+                  'Accuracy {acc.val:.3f} ({acc.avg:.3f})'.format(
+                      epoch, i, len(train_loader), batch_time=batch_time,
+                      speed=input.size(0)/batch_time.val,
+                      data_time=data_time, loss=losses, acc=acc)
+            logger.info(msg)
+
+            writer = writer_dict['writer']
+            global_steps = writer_dict['train_global_steps']
+            writer.add_scalar('train_loss', losses.val, global_steps)
+            writer.add_scalar('train_acc', acc.val, global_steps)
+            writer_dict['train_global_steps'] = global_steps + 1
+
+            prefix = '{}_{}'.format(os.path.join(output_dir, 'train'), i)
+            save_debug_images(config, input, meta, target, pred*4, output,
+                              prefix)
+
+
+def validate(config, val_loader, val_dataset, model, criterion, output_dir,
+             tb_log_dir, writer_dict=None):
+    batch_time = AverageMeter()
+    losses = AverageMeter()
+    acc = AverageMeter()
+
+    # switch to evaluate mode
+    model.eval()
+
+    num_samples = len(val_dataset)
+    all_preds = np.zeros((num_samples, config.MODEL.NUM_JOINTS, 3),
+                         dtype=np.float32)
+    all_boxes = np.zeros((num_samples, 6))
+    image_path = []
+    filenames = []
+    imgnums = []
+    idx = 0
+    with torch.no_grad():
+        end = time.time()
+        for i, (input, target, target_weight, meta) in enumerate(val_loader):
+            # compute output
+            output = model(input)
+            if config.TEST.FLIP_TEST:
+                # this part is ugly, because pytorch has not supported negative index
+                # input_flipped = model(input[:, :, :, ::-1])
+                input_flipped = np.flip(input.cpu().numpy(), 3).copy()
+                input_flipped = torch.from_numpy(input_flipped).cuda()
+                output_flipped = model(input_flipped)
+                output_flipped = flip_back(output_flipped.cpu().numpy(),
+                                           val_dataset.flip_pairs)
+                output_flipped = torch.from_numpy(output_flipped.copy()).cuda()
+
+                # feature is not aligned, shift flipped heatmap for higher accuracy
+                if config.TEST.SHIFT_HEATMAP:
+                    output_flipped[:, :, :, 1:] = \
+                        output_flipped.clone()[:, :, :, 0:-1]
+                    # output_flipped[:, :, :, 0] = 0
+
+                output = (output + output_flipped) * 0.5
+
+            target = target.cuda(non_blocking=True)
+            target_weight = target_weight.cuda(non_blocking=True)
+
+            loss = criterion(output, target, target_weight)
+
+            num_images = input.size(0)
+            # measure accuracy and record loss
+            losses.update(loss.item(), num_images)
+            _, avg_acc, cnt, pred = accuracy(output.cpu().numpy(),
+                                             target.cpu().numpy())
+
+            acc.update(avg_acc, cnt)
+
+            # measure elapsed time
+            batch_time.update(time.time() - end)
+            end = time.time()
+
+            c = meta['center'].numpy()
+            s = meta['scale'].numpy()
+            score = meta['score'].numpy()
+
+            preds, maxvals = get_final_preds(
+                config, output.clone().cpu().numpy(), c, s)
+
+            all_preds[idx:idx + num_images, :, 0:2] = preds[:, :, 0:2]
+            all_preds[idx:idx + num_images, :, 2:3] = maxvals
+            # double check this all_boxes parts
+            all_boxes[idx:idx + num_images, 0:2] = c[:, 0:2]
+            all_boxes[idx:idx + num_images, 2:4] = s[:, 0:2]
+            all_boxes[idx:idx + num_images, 4] = np.prod(s*200, 1)
+            all_boxes[idx:idx + num_images, 5] = score
+            image_path.extend(meta['image'])
+            if config.DATASET.DATASET == 'posetrack':
+                filenames.extend(meta['filename'])
+                imgnums.extend(meta['imgnum'].numpy())
+
+            idx += num_images
+
+            if i % config.PRINT_FREQ == 0:
+                msg = 'Test: [{0}/{1}]\t' \
+                      'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t' \
+                      'Loss {loss.val:.4f} ({loss.avg:.4f})\t' \
+                      'Accuracy {acc.val:.3f} ({acc.avg:.3f})'.format(
+                          i, len(val_loader), batch_time=batch_time,
+                          loss=losses, acc=acc)
+                logger.info(msg)
+
+                prefix = '{}_{}'.format(os.path.join(output_dir, 'val'), i)
+                save_debug_images(config, input, meta, target, pred*4, output,
+                                  prefix)
+
+        name_values, perf_indicator = val_dataset.evaluate(
+            config, all_preds, output_dir, all_boxes, image_path,
+            filenames, imgnums)
+
+        _, full_arch_name = get_model_name(config)
+        if isinstance(name_values, list):
+            for name_value in name_values:
+                _print_name_value(name_value, full_arch_name)
+        else:
+            _print_name_value(name_values, full_arch_name)
+
+        if writer_dict:
+            writer = writer_dict['writer']
+            global_steps = writer_dict['valid_global_steps']
+            writer.add_scalar('valid_loss', losses.avg, global_steps)
+            writer.add_scalar('valid_acc', acc.avg, global_steps)
+            if isinstance(name_values, list):
+                for name_value in name_values:
+                    writer.add_scalars('valid', dict(name_value), global_steps)
+            else:
+                writer.add_scalars('valid', dict(name_values), global_steps)
+            writer_dict['valid_global_steps'] = global_steps + 1
+
+    return perf_indicator
+
+
+# markdown format output
+def _print_name_value(name_value, full_arch_name):
+    names = name_value.keys()
+    values = name_value.values()
+    num_values = len(name_value)
+    logger.info(
+        '| Arch ' +
+        ' '.join(['| {}'.format(name) for name in names]) +
+        ' |'
+    )
+    logger.info('|---' * (num_values+1) + '|')
+    logger.info(
+        '| ' + full_arch_name + ' ' +
+        ' '.join(['| {:.3f}'.format(value) for value in values]) +
+         ' |'
+    )
+
+
+class AverageMeter(object):
+    """Computes and stores the average and current value"""
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count if self.count != 0 else 0
--- a/lib/core/inference.py
+++ b/lib/core/inference.py
@ -0,0 +1,74 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import numpy as np
+
+from utils.transforms import transform_preds
+
+
+def get_max_preds(batch_heatmaps):
+    '''
+    get predictions from score maps
+    heatmaps: numpy.ndarray([batch_size, num_joints, height, width])
+    '''
+    assert isinstance(batch_heatmaps, np.ndarray), \
+        'batch_heatmaps should be numpy.ndarray'
+    assert batch_heatmaps.ndim == 4, 'batch_images should be 4-ndim'
+
+    batch_size = batch_heatmaps.shape[0]
+    num_joints = batch_heatmaps.shape[1]
+    width = batch_heatmaps.shape[3]
+    heatmaps_reshaped = batch_heatmaps.reshape((batch_size, num_joints, -1))
+    idx = np.argmax(heatmaps_reshaped, 2)
+    maxvals = np.amax(heatmaps_reshaped, 2)
+
+    maxvals = maxvals.reshape((batch_size, num_joints, 1))
+    idx = idx.reshape((batch_size, num_joints, 1))
+
+    preds = np.tile(idx, (1, 1, 2)).astype(np.float32)
+
+    preds[:, :, 0] = (preds[:, :, 0]) % width
+    preds[:, :, 1] = np.floor((preds[:, :, 1]) / width)
+
+    pred_mask = np.tile(np.greater(maxvals, 0.0), (1, 1, 2))
+    pred_mask = pred_mask.astype(np.float32)
+
+    preds *= pred_mask
+    return preds, maxvals
+
+
+def get_final_preds(config, batch_heatmaps, center, scale):
+    coords, maxvals = get_max_preds(batch_heatmaps)
+
+    heatmap_height = batch_heatmaps.shape[2]
+    heatmap_width = batch_heatmaps.shape[3]
+
+    # post-processing
+    if config.TEST.POST_PROCESS:
+        for n in range(coords.shape[0]):
+            for p in range(coords.shape[1]):
+                hm = batch_heatmaps[n][p]
+                px = int(math.floor(coords[n][p][0] + 0.5))
+                py = int(math.floor(coords[n][p][1] + 0.5))
+                if 1 < px < heatmap_width-1 and 1 < py < heatmap_height-1:
+                    diff = np.array([hm[py][px+1] - hm[py][px-1],
+                                     hm[py+1][px]-hm[py-1][px]])
+                    coords[n][p] += np.sign(diff) * .25
+
+    preds = coords.copy()
+
+    # Transform back
+    for i in range(coords.shape[0]):
+        preds[i] = transform_preds(coords[i], center[i], scale[i],
+                                   [heatmap_width, heatmap_height])
+
+    return preds, maxvals
--- a/lib/core/loss.py
+++ b/lib/core/loss.py
@ -0,0 +1,38 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import torch.nn as nn
+
+
+class JointsMSELoss(nn.Module):
+    def __init__(self, use_target_weight):
+        super(JointsMSELoss, self).__init__()
+        self.criterion = nn.MSELoss(size_average=True)
+        self.use_target_weight = use_target_weight
+
+    def forward(self, output, target, target_weight):
+        batch_size = output.size(0)
+        num_joints = output.size(1)
+        heatmaps_pred = output.reshape((batch_size, num_joints, -1)).split(1, 1)
+        heatmaps_gt = target.reshape((batch_size, num_joints, -1)).split(1, 1)
+        loss = 0
+
+        for idx in range(num_joints):
+            heatmap_pred = heatmaps_pred[idx].squeeze()
+            heatmap_gt = heatmaps_gt[idx].squeeze()
+            if self.use_target_weight:
+                loss += 0.5 * self.criterion(
+                    heatmap_pred.mul(target_weight[:, idx]),
+                    heatmap_gt.mul(target_weight[:, idx])
+                )
+            else:
+                loss += 0.5 * self.criterion(heatmap_pred, heatmap_gt)
+
+        return loss / num_joints
--- a/lib/dataset/JointsDataset.py
+++ b/lib/dataset/JointsDataset.py
@ -0,0 +1,218 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import logging
+import random
+
+import cv2
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+
+from utils.transforms import get_affine_transform
+from utils.transforms import affine_transform
+from utils.transforms import fliplr_joints
+
+
+logger = logging.getLogger(__name__)
+
+
+class JointsDataset(Dataset):
+    def __init__(self, cfg, root, image_set, is_train, transform=None):
+        self.num_joints = 0
+        self.pixel_std = 200
+        self.flip_pairs = []
+        self.parent_ids = []
+
+        self.is_train = is_train
+        self.root = root
+        self.image_set = image_set
+
+        self.output_path = cfg.OUTPUT_DIR
+        self.data_format = cfg.DATASET.DATA_FORMAT
+
+        self.scale_factor = cfg.DATASET.SCALE_FACTOR
+        self.rotation_factor = cfg.DATASET.ROT_FACTOR
+        self.flip = cfg.DATASET.FLIP
+
+        self.image_size = cfg.MODEL.IMAGE_SIZE
+        self.target_type = cfg.MODEL.EXTRA.TARGET_TYPE
+        self.heatmap_size = cfg.MODEL.EXTRA.HEATMAP_SIZE
+        self.sigma = cfg.MODEL.EXTRA.SIGMA
+
+        self.transform = transform
+        self.db = []
+
+    def _get_db(self):
+        raise NotImplementedError
+
+    def evaluate(self, cfg, preds, output_dir, *args, **kwargs):
+        raise NotImplementedError
+
+    def __len__(self,):
+        return len(self.db)
+
+    def __getitem__(self, idx):
+        db_rec = copy.deepcopy(self.db[idx])
+
+        image_file = db_rec['image']
+        filename = db_rec['filename'] if 'filename' in db_rec else ''
+        imgnum = db_rec['imgnum'] if 'imgnum' in db_rec else ''
+
+        if self.data_format == 'zip':
+            from utils import zipreader
+            data_numpy = zipreader.imread(
+                image_file, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+        else:
+            data_numpy = cv2.imread(
+                image_file, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+
+        joints = db_rec['joints_3d']
+        joints_vis = db_rec['joints_3d_vis']
+
+        c = db_rec['center']
+        s = db_rec['scale']
+        score = db_rec['score'] if 'score' in db_rec else 1
+        r = 0
+
+        if self.is_train:
+            sf = self.scale_factor
+            rf = self.rotation_factor
+            s = s * np.clip(np.random.randn()*sf + 1, 1 - sf, 1 + sf)
+            r = np.clip(np.random.randn()*rf, -rf*2, rf*2) \
+                if random.random() <= 0.6 else 0
+
+            if self.flip and random.random() <= 0.5:
+                data_numpy = data_numpy[:, ::-1, :]
+                joints, joints_vis = fliplr_joints(
+                    joints, joints_vis, data_numpy.shape[1], self.flip_pairs)
+                c[0] = data_numpy.shape[1] - c[0] - 1
+
+        trans = get_affine_transform(c, s, r, self.image_size)
+        input = cv2.warpAffine(
+            data_numpy,
+            trans,
+            (int(self.image_size[0]), int(self.image_size[1])),
+            flags=cv2.INTER_LINEAR)
+
+        if self.transform:
+            input = self.transform(input)
+
+        for i in range(self.num_joints):
+            if joints_vis[i, 0] > 0.0:
+                joints[i, 0:2] = affine_transform(joints[i, 0:2], trans)
+
+        target, target_weight = self.generate_target(joints, joints_vis)
+
+        target = torch.from_numpy(target)
+        target_weight = torch.from_numpy(target_weight)
+
+        meta = {
+            'image': image_file,
+            'filename': filename,
+            'imgnum': imgnum,
+            'joints': joints,
+            'joints_vis': joints_vis,
+            'center': c,
+            'scale': s,
+            'rotation': r,
+            'score': score
+        }
+
+        return input, target, target_weight, meta
+
+    def select_data(self, db):
+        db_selected = []
+        for rec in db:
+            num_vis = 0
+            joints_x = 0.0
+            joints_y = 0.0
+            for joint, joint_vis in zip(
+                    rec['joints_3d'], rec['joints_3d_vis']):
+                if joint_vis[0] <= 0:
+                    continue
+                num_vis += 1
+
+                joints_x += joint[0]
+                joints_y += joint[1]
+            if num_vis == 0:
+                continue
+
+            joints_x, joints_y = joints_x / num_vis, joints_y / num_vis
+
+            area = rec['scale'][0] * rec['scale'][1] * (self.pixel_std**2)
+            joints_center = np.array([joints_x, joints_y])
+            bbox_center = np.array(rec['center'])
+            diff_norm2 = np.linalg.norm((joints_center-bbox_center), 2)
+            ks = np.exp(-1.0*(diff_norm2**2) / ((0.2)**2*2.0*area))
+
+            metric = (0.2 / 16) * num_vis + 0.45 - 0.2 / 16
+            if ks > metric:
+                db_selected.append(rec)
+
+        logger.info('=> num db: {}'.format(len(db)))
+        logger.info('=> num selected db: {}'.format(len(db_selected)))
+        return db_selected
+
+    def generate_target(self, joints, joints_vis):
+        '''
+        :param joints:  [num_joints, 3]
+        :param joints_vis: [num_joints, 3]
+        :return: target, target_weight(1: visible, 0: invisible)
+        '''
+        target_weight = np.ones((self.num_joints, 1), dtype=np.float32)
+        target_weight[:, 0] = joints_vis[:, 0]
+
+        assert self.target_type == 'gaussian', \
+            'Only support gaussian map now!'
+
+        if self.target_type == 'gaussian':
+            target = np.zeros((self.num_joints,
+                               self.heatmap_size[1],
+                               self.heatmap_size[0]),
+                              dtype=np.float32)
+
+            tmp_size = self.sigma * 3
+
+            for joint_id in range(self.num_joints):
+                feat_stride = self.image_size / self.heatmap_size
+                mu_x = int(joints[joint_id][0] / feat_stride[0] + 0.5)
+                mu_y = int(joints[joint_id][1] / feat_stride[1] + 0.5)
+                # Check that any part of the gaussian is in-bounds
+                ul = [int(mu_x - tmp_size), int(mu_y - tmp_size)]
+                br = [int(mu_x + tmp_size + 1), int(mu_y + tmp_size + 1)]
+                if ul[0] >= self.heatmap_size[0] or ul[1] >= self.heatmap_size[1] \
+                        or br[0] < 0 or br[1] < 0:
+                    # If not, just return the image as is
+                    target_weight[joint_id] = 0
+                    continue
+
+                # # Generate gaussian
+                size = 2 * tmp_size + 1
+                x = np.arange(0, size, 1, np.float32)
+                y = x[:, np.newaxis]
+                x0 = y0 = size // 2
+                # The gaussian is not normalized, we want the center value to equal 1
+                g = np.exp(- ((x - x0) ** 2 + (y - y0) ** 2) / (2 * self.sigma ** 2))
+
+                # Usable gaussian range
+                g_x = max(0, -ul[0]), min(br[0], self.heatmap_size[0]) - ul[0]
+                g_y = max(0, -ul[1]), min(br[1], self.heatmap_size[1]) - ul[1]
+                # Image range
+                img_x = max(0, ul[0]), min(br[0], self.heatmap_size[0])
+                img_y = max(0, ul[1]), min(br[1], self.heatmap_size[1])
+
+                v = target_weight[joint_id]
+                if v > 0.5:
+                    target[joint_id][img_y[0]:img_y[1], img_x[0]:img_x[1]] = \
+                        g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
+
+        return target, target_weight
--- a/lib/dataset/init.py
+++ b/lib/dataset/init.py
@ -0,0 +1,12 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from .mpii import MPIIDataset as mpii
+from .coco import COCODataset as coco
--- a/lib/dataset/coco.py
+++ b/lib/dataset/coco.py
@ -0,0 +1,404 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import logging
+import os
+import pickle
+from collections import defaultdict
+from collections import OrderedDict
+
+import json_tricks as json
+import numpy as np
+from pycocotools.coco import COCO
+from pycocotools.cocoeval import COCOeval
+
+from dataset.JointsDataset import JointsDataset
+from nms.nms import oks_nms
+
+
+logger = logging.getLogger(__name__)
+
+
+class COCODataset(JointsDataset):
+    '''
+    "keypoints": {
+        0: "nose",
+        1: "left_eye",
+        2: "right_eye",
+        3: "left_ear",
+        4: "right_ear",
+        5: "left_shoulder",
+        6: "right_shoulder",
+        7: "left_elbow",
+        8: "right_elbow",
+        9: "left_wrist",
+        10: "right_wrist",
+        11: "left_hip",
+        12: "right_hip",
+        13: "left_knee",
+        14: "right_knee",
+        15: "left_ankle",
+        16: "right_ankle"
+    },
+	"skeleton": [
+        [16,14],[14,12],[17,15],[15,13],[12,13],[6,12],[7,13], [6,7],[6,8],
+        [7,9],[8,10],[9,11],[2,3],[1,2],[1,3],[2,4],[3,5],[4,6],[5,7]]
+    '''
+    def __init__(self, cfg, root, image_set, is_train, transform=None):
+        super().__init__(cfg, root, image_set, is_train, transform)
+        self.nms_thre = cfg.TEST.NMS_THRE
+        self.image_thre = cfg.TEST.IMAGE_THRE
+        self.oks_thre = cfg.TEST.OKS_THRE
+        self.in_vis_thre = cfg.TEST.IN_VIS_THRE
+        self.bbox_file = cfg.TEST.COCO_BBOX_FILE
+        self.use_gt_bbox = cfg.TEST.USE_GT_BBOX
+        self.image_width = cfg.MODEL.IMAGE_SIZE[0]
+        self.image_height = cfg.MODEL.IMAGE_SIZE[1]
+        self.aspect_ratio = self.image_width * 1.0 / self.image_height
+        self.pixel_std = 200
+        self.coco = COCO(self._get_ann_file_keypoint())
+
+        # deal with class names
+        cats = [cat['name']
+                for cat in self.coco.loadCats(self.coco.getCatIds())]
+        self.classes = ['__background__'] + cats
+        logger.info('=> classes: {}'.format(self.classes))
+        self.num_classes = len(self.classes)
+        self._class_to_ind = dict(zip(self.classes, range(self.num_classes)))
+        self._class_to_coco_ind = dict(zip(cats, self.coco.getCatIds()))
+        self._coco_ind_to_class_ind = dict([(self._class_to_coco_ind[cls],
+                                             self._class_to_ind[cls])
+                                            for cls in self.classes[1:]])
+
+        # load image file names
+        self.image_set_index = self._load_image_set_index()
+        self.num_images = len(self.image_set_index)
+        logger.info('=> num_images: {}'.format(self.num_images))
+
+        self.num_joints = 17
+        self.flip_pairs = [[1, 2], [3, 4], [5, 6], [7, 8],
+                           [9, 10], [11, 12], [13, 14], [15, 16]]
+        self.parent_ids = None
+
+        self.db = self._get_db()
+
+        if is_train and cfg.DATASET.SELECT_DATA:
+            self.db = self.select_data(self.db)
+
+        logger.info('=> load {} samples'.format(len(self.db)))
+
+    def _get_ann_file_keypoint(self):
+        """ self.root / annotations / person_keypoints_train2017.json """
+        prefix = 'person_keypoints' \
+            if 'test' not in self.image_set else 'image_info'
+        return os.path.join(self.root, 'annotations',
+                            prefix + '_' + self.image_set + '.json')
+
+    def _load_image_set_index(self):
+        """ image id: int """
+        image_ids = self.coco.getImgIds()
+        return image_ids
+
+    def _get_db(self):
+        if self.is_train or self.use_gt_bbox:
+            # use ground truth bbox
+            gt_db = self._load_coco_keypoint_annotations()
+        else:
+            # use bbox from detection
+            gt_db = self._load_coco_person_detection_results()
+        return gt_db
+
+    def _load_coco_keypoint_annotations(self):
+        """ ground truth bbox and keypoints """
+        gt_db = []
+        for index in self.image_set_index:
+            gt_db.extend(self._load_coco_keypoint_annotation_kernal(index))
+        return gt_db
+
+    def _load_coco_keypoint_annotation_kernal(self, index):
+        """
+        coco ann: [u'segmentation', u'area', u'iscrowd', u'image_id', u'bbox', u'category_id', u'id']
+        iscrowd:
+            crowd instances are handled by marking their overlaps with all categories to -1
+            and later excluded in training
+        bbox:
+            [x1, y1, w, h]
+        :param index: coco image id
+        :return: db entry
+        """
+        im_ann = self.coco.loadImgs(index)[0]
+        width = im_ann['width']
+        height = im_ann['height']
+
+        annIds = self.coco.getAnnIds(imgIds=index, iscrowd=False)
+        objs = self.coco.loadAnns(annIds)
+
+        # sanitize bboxes
+        valid_objs = []
+        for obj in objs:
+            x, y, w, h = obj['bbox']
+            x1 = np.max((0, x))
+            y1 = np.max((0, y))
+            x2 = np.min((width - 1, x1 + np.max((0, w - 1))))
+            y2 = np.min((height - 1, y1 + np.max((0, h - 1))))
+            if obj['area'] > 0 and x2 >= x1 and y2 >= y1:
+                # obj['clean_bbox'] = [x1, y1, x2, y2]
+                obj['clean_bbox'] = [x1, y1, x2-x1, y2-y1]
+                valid_objs.append(obj)
+        objs = valid_objs
+
+        rec = []
+        for obj in objs:
+            cls = self._coco_ind_to_class_ind[obj['category_id']]
+            if cls != 1:
+                continue
+
+            # ignore objs without keypoints annotation
+            if max(obj['keypoints']) == 0:
+                continue
+
+            joints_3d = np.zeros((self.num_joints, 3), dtype=np.float)
+            joints_3d_vis = np.zeros((self.num_joints, 3), dtype=np.float)
+            for ipt in range(self.num_joints):
+                joints_3d[ipt, 0] = obj['keypoints'][ipt * 3 + 0]
+                joints_3d[ipt, 1] = obj['keypoints'][ipt * 3 + 1]
+                joints_3d[ipt, 2] = 0
+                t_vis = obj['keypoints'][ipt * 3 + 2]
+                if t_vis > 1:
+                    t_vis = 1
+                joints_3d_vis[ipt, 0] = t_vis
+                joints_3d_vis[ipt, 1] = t_vis
+                joints_3d_vis[ipt, 2] = 0
+
+            center, scale = self._box2cs(obj['clean_bbox'][:4])
+            rec.append({
+                'image': self.image_path_from_index(index),
+                'center': center,
+                'scale': scale,
+                'joints_3d': joints_3d,
+                'joints_3d_vis': joints_3d_vis,
+                'filename': '',
+                'imgnum': 0,
+            })
+
+        return rec
+
+    def _box2cs(self, box):
+        x, y, w, h = box[:4]
+        return self._xywh2cs(x, y, w, h)
+
+    def _xywh2cs(self, x, y, w, h):
+        center = np.zeros((2), dtype=np.float32)
+        center[0] = x + w * 0.5
+        center[1] = y + h * 0.5
+
+        if w > self.aspect_ratio * h:
+            h = w * 1.0 / self.aspect_ratio
+        elif w < self.aspect_ratio * h:
+            w = h * self.aspect_ratio
+        scale = np.array(
+            [w * 1.0 / self.pixel_std, h * 1.0 / self.pixel_std],
+            dtype=np.float32)
+        if center[0] != -1:
+            scale = scale * 1.25
+
+        return center, scale
+
+    def image_path_from_index(self, index):
+        """ example: images / train2017 / 000000119993.jpg """
+        file_name = '%012d.jpg' % index
+        if '2014' in self.image_set:
+            file_name = 'COCO_%s_' % self.image_set + file_name
+
+        prefix = 'test2017' if 'test' in self.image_set else self.image_set
+
+        data_name = prefix + '.zip@' if self.data_format == 'zip' else prefix
+
+        image_path = os.path.join(
+            self.root, 'images', data_name, file_name)
+
+        return image_path
+
+    def _load_coco_person_detection_results(self):
+        all_boxes = None
+        with open(self.bbox_file, 'r') as f:
+            all_boxes = json.load(f)
+
+        if not all_boxes:
+            logger.error('=> Load %s fail!' % self.bbox_file)
+            return None
+
+        logger.info('=> Total boxes: {}'.format(len(all_boxes)))
+
+        kpt_db = []
+        num_boxes = 0
+        for n_img in range(0, len(all_boxes)):
+            det_res = all_boxes[n_img]
+            if det_res['category_id'] != 1:
+                continue
+            img_name = self.image_path_from_index(det_res['image_id'])
+            box = det_res['bbox']
+            score = det_res['score']
+
+            if score < self.image_thre:
+                continue
+
+            num_boxes = num_boxes + 1
+
+            center, scale = self._box2cs(box)
+            joints_3d = np.zeros((self.num_joints, 3), dtype=np.float)
+            joints_3d_vis = np.ones(
+                (self.num_joints, 3), dtype=np.float)
+            kpt_db.append({
+                'image': img_name,
+                'center': center,
+                'scale': scale,
+                'score': score,
+                'joints_3d': joints_3d,
+                'joints_3d_vis': joints_3d_vis,
+            })
+
+        logger.info('=> Total boxes after fliter low score@{}: {}'.format(
+            self.image_thre, num_boxes))
+        return kpt_db
+
+    # need double check this API and classes field
+    def evaluate(self, cfg, preds, output_dir, all_boxes, img_path,
+                 *args, **kwargs):
+        res_folder = os.path.join(output_dir, 'results')
+        if not os.path.exists(res_folder):
+            os.makedirs(res_folder)
+        res_file = os.path.join(
+            res_folder, 'keypoints_%s_results.json' % self.image_set)
+
+        # person x (keypoints)
+        _kpts = []
+        for idx, kpt in enumerate(preds):
+            _kpts.append({
+                'keypoints': kpt,
+                'center': all_boxes[idx][0:2],
+                'scale': all_boxes[idx][2:4],
+                'area': all_boxes[idx][4],
+                'score': all_boxes[idx][5],
+                'image': int(img_path[idx][-16:-4])
+            })
+        # image x person x (keypoints)
+        kpts = defaultdict(list)
+        for kpt in _kpts:
+            kpts[kpt['image']].append(kpt)
+
+        # rescoring and oks nms
+        num_joints = self.num_joints
+        in_vis_thre = self.in_vis_thre
+        oks_thre = self.oks_thre
+        oks_nmsed_kpts = []
+        for img in kpts.keys():
+            img_kpts = kpts[img]
+            for n_p in img_kpts:
+                box_score = n_p['score']
+                kpt_score = 0
+                valid_num = 0
+                for n_jt in range(0, num_joints):
+                    t_s = n_p['keypoints'][n_jt][2]
+                    if t_s > in_vis_thre:
+                        kpt_score = kpt_score + t_s
+                        valid_num = valid_num + 1
+                if valid_num != 0:
+                    kpt_score = kpt_score / valid_num
+                # rescoring
+                n_p['score'] = kpt_score * box_score
+            keep = oks_nms([img_kpts[i] for i in range(len(img_kpts))],
+                           oks_thre)
+            if len(keep) == 0:
+                oks_nmsed_kpts.append(img_kpts)
+            else:
+                oks_nmsed_kpts.append([img_kpts[_keep] for _keep in keep])
+
+        self._write_coco_keypoint_results(
+            oks_nmsed_kpts, res_file)
+        if 'test' not in self.image_set:
+            info_str = self._do_python_keypoint_eval(
+                res_file, res_folder)
+            name_value = OrderedDict(info_str)
+            return name_value, name_value['AP']
+        else:
+            return {'Null': 0}, 0
+
+    def _write_coco_keypoint_results(self, keypoints, res_file):
+        data_pack = [{'cat_id': self._class_to_coco_ind[cls],
+                      'cls_ind': cls_ind,
+                      'cls': cls,
+                      'ann_type': 'keypoints',
+                      'keypoints': keypoints
+                      }
+                     for cls_ind, cls in enumerate(self.classes) if not cls == '__background__']
+
+        results = self._coco_keypoint_results_one_category_kernel(data_pack[0])
+        logger.info('=> Writing results json to %s' % res_file)
+        with open(res_file, 'w') as f:
+            json.dump(results, f, sort_keys=True, indent=4)
+        try:
+            json.load(open(res_file))
+        except Exception:
+            content = []
+            with open(res_file, 'r') as f:
+                for line in f:
+                    content.append(line)
+            content[-1] = ']'
+            with open(res_file, 'w') as f:
+                for c in content:
+                    f.write(c)
+
+    def _coco_keypoint_results_one_category_kernel(self, data_pack):
+        cat_id = data_pack['cat_id']
+        keypoints = data_pack['keypoints']
+        cat_results = []
+
+        for img_kpts in keypoints:
+            if len(img_kpts) == 0:
+                continue
+
+            _key_points = np.array([img_kpts[k]['keypoints']
+                                    for k in range(len(img_kpts))])
+            key_points = np.zeros(
+                (_key_points.shape[0], self.num_joints * 3), dtype=np.float)
+
+            for ipt in range(self.num_joints):
+                key_points[:, ipt * 3 + 0] = _key_points[:, ipt, 0]
+                key_points[:, ipt * 3 + 1] = _key_points[:, ipt, 1]
+                key_points[:, ipt * 3 + 2] = _key_points[:, ipt, 2]  # keypoints score.
+
+            result = [{'image_id': img_kpts[k]['image'],
+                       'category_id': cat_id,
+                       'keypoints': list(key_points[k]),
+                       'score': img_kpts[k]['score'],
+                       'center': list(img_kpts[k]['center']),
+                       'scale': list(img_kpts[k]['scale'])
+                       } for k in range(len(img_kpts))]
+            cat_results.extend(result)
+
+        return cat_results
+
+    def _do_python_keypoint_eval(self, res_file, res_folder):
+        coco_dt = self.coco.loadRes(res_file)
+        coco_eval = COCOeval(self.coco, coco_dt, 'keypoints')
+        coco_eval.params.useSegm = None
+        coco_eval.evaluate()
+        coco_eval.accumulate()
+        info_str = coco_eval.summarize()
+
+        eval_file = os.path.join(
+            res_folder, 'keypoints_%s_results.pkl' % self.image_set)
+
+        with open(eval_file, 'wb') as f:
+            pickle.dump(coco_eval, f, pickle.HIGHEST_PROTOCOL)
+        logger.info('=> coco eval results saved to %s' % eval_file)
+
+        return info_str
--- a/lib/dataset/mpii.py
+++ b/lib/dataset/mpii.py
@ -0,0 +1,176 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import OrderedDict
+import logging
+import os
+import json_tricks as json
+
+import numpy as np
+from scipy.io import loadmat, savemat
+
+from dataset.JointsDataset import JointsDataset
+
+
+logger = logging.getLogger(__name__)
+
+
+class MPIIDataset(JointsDataset):
+    def __init__(self, cfg, root, image_set, is_train, transform=None):
+        super().__init__(cfg, root, image_set, is_train, transform)
+
+        self.num_joints = 16
+        self.flip_pairs = [[0, 5], [1, 4], [2, 3], [10, 15], [11, 14], [12, 13]]
+        self.parent_ids = [1, 2, 6, 6, 3, 4, 6, 6, 7, 8, 11, 12, 7, 7, 13, 14]
+
+        self.db = self._get_db()
+
+        if is_train and cfg.DATASET.SELECT_DATA:
+            self.db = self.select_data(self.db)
+
+        logger.info('=> load {} samples'.format(len(self.db)))
+
+    def _get_db(self):
+        # create train/val split
+        file_name = os.path.join(self.root,
+                                 'annot',
+                                 self.image_set+'.json')
+        with open(file_name) as anno_file:
+            anno = json.load(anno_file)
+
+        gt_db = []
+        for a in anno:
+            image_name = a['image']
+
+            c = np.array(a['center'], dtype=np.float)
+            s = np.array([a['scale'], a['scale']], dtype=np.float)
+
+            # Adjust center/scale slightly to avoid cropping limbs
+            if c[0] != -1:
+                c[1] = c[1] + 15 * s[1]
+                s = s * 1.25
+
+            # MPII uses matlab format, index is based 1,
+            # we should first convert to 0-based index
+            c = c - 1
+
+            joints_3d = np.zeros((self.num_joints, 3), dtype=np.float)
+            joints_3d_vis = np.zeros((self.num_joints,  3), dtype=np.float)
+            if self.image_set != 'test':
+                joints = np.array(a['joints'])
+                joints[:, 0:2] = joints[:, 0:2] - 1
+                joints_vis = np.array(a['joints_vis'])
+                assert len(joints) == self.num_joints, \
+                    'joint num diff: {} vs {}'.format(len(joints),
+                                                      self.num_joints)
+
+                joints_3d[:, 0:2] = joints[:, 0:2]
+                joints_3d_vis[:, 0] = joints_vis[:]
+                joints_3d_vis[:, 1] = joints_vis[:]
+
+            image_dir = 'images.zip@' if self.data_format == 'zip' else 'images'
+            gt_db.append({
+                'image': os.path.join(self.root, image_dir, image_name),
+                'center': c,
+                'scale': s,
+                'joints_3d': joints_3d,
+                'joints_3d_vis': joints_3d_vis,
+                'filename': '',
+                'imgnum': 0,
+                })
+
+        return gt_db
+
+    def evaluate(self, cfg, preds, output_dir, *args, **kwargs):
+        # convert 0-based index to 1-based index
+        preds = preds[:, :, 0:2] + 1.0
+
+        if output_dir:
+            pred_file = os.path.join(output_dir, 'pred.mat')
+            savemat(pred_file, mdict={'preds': preds})
+
+        if 'test' in cfg.DATASET.TEST_SET:
+            return {'Null': 0.0}, 0.0
+
+        SC_BIAS = 0.6
+        threshold = 0.5
+
+        gt_file = os.path.join(cfg.DATASET.ROOT,
+                               'annot',
+                               'gt_{}.mat'.format(cfg.DATASET.TEST_SET))
+        gt_dict = loadmat(gt_file)
+        dataset_joints = gt_dict['dataset_joints']
+        jnt_missing = gt_dict['jnt_missing']
+        pos_gt_src = gt_dict['pos_gt_src']
+        headboxes_src = gt_dict['headboxes_src']
+
+        pos_pred_src = np.transpose(preds, [1, 2, 0])
+
+        head = np.where(dataset_joints == 'head')[1][0]
+        lsho = np.where(dataset_joints == 'lsho')[1][0]
+        lelb = np.where(dataset_joints == 'lelb')[1][0]
+        lwri = np.where(dataset_joints == 'lwri')[1][0]
+        lhip = np.where(dataset_joints == 'lhip')[1][0]
+        lkne = np.where(dataset_joints == 'lkne')[1][0]
+        lank = np.where(dataset_joints == 'lank')[1][0]
+
+        rsho = np.where(dataset_joints == 'rsho')[1][0]
+        relb = np.where(dataset_joints == 'relb')[1][0]
+        rwri = np.where(dataset_joints == 'rwri')[1][0]
+        rkne = np.where(dataset_joints == 'rkne')[1][0]
+        rank = np.where(dataset_joints == 'rank')[1][0]
+        rhip = np.where(dataset_joints == 'rhip')[1][0]
+
+        jnt_visible = 1 - jnt_missing
+        uv_error = pos_pred_src - pos_gt_src
+        uv_err = np.linalg.norm(uv_error, axis=1)
+        headsizes = headboxes_src[1, :, :] - headboxes_src[0, :, :]
+        headsizes = np.linalg.norm(headsizes, axis=0)
+        headsizes *= SC_BIAS
+        scale = np.multiply(headsizes, np.ones((len(uv_err), 1)))
+        scaled_uv_err = np.divide(uv_err, scale)
+        scaled_uv_err = np.multiply(scaled_uv_err, jnt_visible)
+        jnt_count = np.sum(jnt_visible, axis=1)
+        less_than_threshold = np.multiply((scaled_uv_err <= threshold),
+                                          jnt_visible)
+        PCKh = np.divide(100.*np.sum(less_than_threshold, axis=1), jnt_count)
+
+        # save
+        rng = np.arange(0, 0.5+0.01, 0.01)
+        pckAll = np.zeros((len(rng), 16))
+
+        for r in range(len(rng)):
+            threshold = rng[r]
+            less_than_threshold = np.multiply(scaled_uv_err <= threshold,
+                                              jnt_visible)
+            pckAll[r, :] = np.divide(100.*np.sum(less_than_threshold, axis=1),
+                                     jnt_count)
+
+        PCKh = np.ma.array(PCKh, mask=False)
+        PCKh.mask[6:8] = True
+
+        jnt_count = np.ma.array(jnt_count, mask=False)
+        jnt_count.mask[6:8] = True
+        jnt_ratio = jnt_count / np.sum(jnt_count).astype(np.float64)
+
+        name_value = [
+            ('Head', PCKh[head]),
+            ('Shoulder', 0.5 * (PCKh[lsho] + PCKh[rsho])),
+            ('Elbow', 0.5 * (PCKh[lelb] + PCKh[relb])),
+            ('Wrist', 0.5 * (PCKh[lwri] + PCKh[rwri])),
+            ('Hip', 0.5 * (PCKh[lhip] + PCKh[rhip])),
+            ('Knee', 0.5 * (PCKh[lkne] + PCKh[rkne])),
+            ('Ankle', 0.5 * (PCKh[lank] + PCKh[rank])),
+            ('Mean', np.sum(PCKh * jnt_ratio)),
+            ('Mean@0.1', np.sum(pckAll[11, :] * jnt_ratio))
+        ]
+        name_value = OrderedDict(name_value)
+
+        return name_value, name_value['Mean']
--- a/lib/models/init.py
+++ b/lib/models/init.py
@ -0,0 +1,11 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import models.pose_resnet
--- a/lib/models/pose_resnet.py
+++ b/lib/models/pose_resnet.py
@ -0,0 +1,257 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import logging
+
+import torch
+import torch.nn as nn
+
+
+BN_MOMENTUM = 0.1
+logger = logging.getLogger(__name__)
+
+
+def conv3x3(in_planes, out_planes, stride=1):
+    """3x3 convolution with padding"""
+    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
+                     padding=1, bias=False)
+
+
+class BasicBlock(nn.Module):
+    expansion = 1
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(BasicBlock, self).__init__()
+        self.conv1 = conv3x3(inplanes, planes, stride)
+        self.bn1 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = conv3x3(planes, planes)
+        self.bn2 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class Bottleneck(nn.Module):
+    expansion = 4
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(Bottleneck, self).__init__()
+        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)
+        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
+                               padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)
+        self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1,
+                               bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * self.expansion,
+                                  momentum=BN_MOMENTUM)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class PoseResNet(nn.Module):
+
+    def __init__(self, block, layers, cfg, **kwargs):
+        self.inplanes = 64
+        extra = cfg.MODEL.EXTRA
+        self.deconv_with_bias = extra.DECONV_WITH_BIAS
+
+        super(PoseResNet, self).__init__()
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
+                               bias=False)
+        self.bn1 = nn.BatchNorm2d(64, momentum=BN_MOMENTUM)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
+
+        # used for deconv layers
+        self.deconv_layers = self._make_deconv_layer(
+            extra.NUM_DECONV_LAYERS,
+            extra.NUM_DECONV_FILTERS,
+            extra.NUM_DECONV_KERNELS,
+        )
+
+        self.final_layer = nn.Conv2d(
+            in_channels=extra.NUM_DECONV_FILTERS[-1],
+            out_channels=cfg.MODEL.NUM_JOINTS,
+            kernel_size=extra.FINAL_CONV_KERNEL,
+            stride=1,
+            padding=1 if extra.FINAL_CONV_KERNEL == 3 else 0
+        )
+
+    def _make_layer(self, block, planes, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                nn.Conv2d(self.inplanes, planes * block.expansion,
+                          kernel_size=1, stride=stride, bias=False),
+                nn.BatchNorm2d(planes * block.expansion, momentum=BN_MOMENTUM),
+            )
+
+        layers = []
+        layers.append(block(self.inplanes, planes, stride, downsample))
+        self.inplanes = planes * block.expansion
+        for i in range(1, blocks):
+            layers.append(block(self.inplanes, planes))
+
+        return nn.Sequential(*layers)
+
+    def _get_deconv_cfg(self, deconv_kernel, index):
+        if deconv_kernel == 4:
+            padding = 1
+            output_padding = 0
+        elif deconv_kernel == 3:
+            padding = 1
+            output_padding = 1
+        elif deconv_kernel == 2:
+            padding = 0
+            output_padding = 0
+
+        return deconv_kernel, padding, output_padding
+
+    def _make_deconv_layer(self, num_layers, num_filters, num_kernels):
+        assert num_layers == len(num_filters), \
+            'ERROR: num_deconv_layers is different len(num_deconv_filters)'
+        assert num_layers == len(num_kernels), \
+            'ERROR: num_deconv_layers is different len(num_deconv_filters)'
+
+        layers = []
+        for i in range(num_layers):
+            kernel, padding, output_padding = \
+                self._get_deconv_cfg(num_kernels[i], i)
+
+            planes = num_filters[i]
+            layers.append(
+                nn.ConvTranspose2d(
+                    in_channels=self.inplanes,
+                    out_channels=planes,
+                    kernel_size=kernel,
+                    stride=2,
+                    padding=padding,
+                    output_padding=output_padding,
+                    bias=self.deconv_with_bias))
+            layers.append(nn.BatchNorm2d(planes, momentum=BN_MOMENTUM))
+            layers.append(nn.ReLU(inplace=True))
+            self.inplanes = planes
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+
+        x = self.deconv_layers(x)
+        x = self.final_layer(x)
+
+        return x
+
+    def init_weights(self, pretrained=''):
+        if os.path.isfile(pretrained):
+            logger.info('=> init deconv weights from normal distribution')
+            for name, m in self.deconv_layers.named_modules():
+                if isinstance(m, nn.ConvTranspose2d):
+                    logger.info('=> init {}.weight as normal(0, 0.001)'.format(name))
+                    logger.info('=> init {}.bias as 0'.format(name))
+                    nn.init.normal_(m.weight, std=0.001)
+                    if self.deconv_with_bias:
+                        nn.init.constant_(m.bias, 0)
+                elif isinstance(m, nn.BatchNorm2d):
+                    logger.info('=> init {}.weight as 1'.format(name))
+                    logger.info('=> init {}.bias as 0'.format(name))
+                    nn.init.constant_(m.weight, 1)
+                    nn.init.constant_(m.bias, 0)
+            logger.info('=> init final conv weights from normal distribution')
+            for m in self.final_layer.modules():
+                if isinstance(m, nn.Conv2d):
+                    # nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
+                    logger.info('=> init {}.weight as normal(0, 0.001)'.format(name))
+                    logger.info('=> init {}.bias as 0'.format(name))
+                    nn.init.normal_(m.weight, std=0.001)
+                    nn.init.constant_(m.bias, 0)
+
+            pretrained_state_dict = torch.load(pretrained)
+            logger.info('=> loading pretrained model {}'.format(pretrained))
+            self.load_state_dict(pretrained_state_dict, strict=False)
+        else:
+            logger.error('=> imagenet pretrained model dose not exist')
+            logger.error('=> please download it first')
+            raise ValueError('imagenet pretrained model does not exist')
+
+
+resnet_spec = {18: (BasicBlock, [2, 2, 2, 2]),
+               34: (BasicBlock, [3, 4, 6, 3]),
+               50: (Bottleneck, [3, 4, 6, 3]),
+               101: (Bottleneck, [3, 4, 23, 3]),
+               152: (Bottleneck, [3, 8, 36, 3])}
+
+
+def get_pose_net(cfg, is_train, **kwargs):
+    num_layers = cfg.MODEL.EXTRA.NUM_LAYERS
+
+    block_class, layers = resnet_spec[num_layers]
+
+    model = PoseResNet(block_class, layers, cfg, **kwargs)
+
+    if is_train and cfg.MODEL.INIT_WEIGHTS:
+        model.init_weights(cfg.MODEL.PRETRAINED)
+
+    return model
--- a/lib/nms/init.py
+++ b/lib/nms/init.py
--- a/lib/nms/cpu_nms.pyx
+++ b/lib/nms/cpu_nms.pyx
@ -0,0 +1,67 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Modified from py-faster-rcnn (https://github.com/rbgirshick/py-faster-rcnn)
+# ------------------------------------------------------------------------------
+
+import numpy as np
+cimport numpy as np
+
+cdef inline np.float32_t max(np.float32_t a, np.float32_t b):
+    return a if a >= b else b
+
+cdef inline np.float32_t min(np.float32_t a, np.float32_t b):
+    return a if a <= b else b
+
+def cpu_nms(np.ndarray[np.float32_t, ndim=2] dets, np.float thresh):
+    cdef np.ndarray[np.float32_t, ndim=1] x1 = dets[:, 0]
+    cdef np.ndarray[np.float32_t, ndim=1] y1 = dets[:, 1]
+    cdef np.ndarray[np.float32_t, ndim=1] x2 = dets[:, 2]
+    cdef np.ndarray[np.float32_t, ndim=1] y2 = dets[:, 3]
+    cdef np.ndarray[np.float32_t, ndim=1] scores = dets[:, 4]
+
+    cdef np.ndarray[np.float32_t, ndim=1] areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1].astype('i')
+
+    cdef int ndets = dets.shape[0]
+    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
+            np.zeros((ndets), dtype=np.int)
+
+    # nominal indices
+    cdef int _i, _j
+    # sorted indices
+    cdef int i, j
+    # temp variables for box i's (the box currently under consideration)
+    cdef np.float32_t ix1, iy1, ix2, iy2, iarea
+    # variables for computing overlap with box j (lower scoring box)
+    cdef np.float32_t xx1, yy1, xx2, yy2
+    cdef np.float32_t w, h
+    cdef np.float32_t inter, ovr
+
+    keep = []
+    for _i in range(ndets):
+        i = order[_i]
+        if suppressed[i] == 1:
+            continue
+        keep.append(i)
+        ix1 = x1[i]
+        iy1 = y1[i]
+        ix2 = x2[i]
+        iy2 = y2[i]
+        iarea = areas[i]
+        for _j in range(_i + 1, ndets):
+            j = order[_j]
+            if suppressed[j] == 1:
+                continue
+            xx1 = max(ix1, x1[j])
+            yy1 = max(iy1, y1[j])
+            xx2 = min(ix2, x2[j])
+            yy2 = min(iy2, y2[j])
+            w = max(0.0, xx2 - xx1 + 1)
+            h = max(0.0, yy2 - yy1 + 1)
+            inter = w * h
+            ovr = inter / (iarea + areas[j] - inter)
+            if ovr >= thresh:
+                suppressed[j] = 1
+
+    return keep
--- a/lib/nms/gpu_nms.hpp
+++ b/lib/nms/gpu_nms.hpp
@ -0,0 +1,2 @@
+void _nms(int* keep_out, int* num_out, const float* boxes_host, int boxes_num,
+          int boxes_dim, float nms_overlap_thresh, int device_id);
--- a/lib/nms/gpu_nms.pyx
+++ b/lib/nms/gpu_nms.pyx
@ -0,0 +1,30 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Modified from py-faster-rcnn (https://github.com/rbgirshick/py-faster-rcnn)
+# ------------------------------------------------------------------------------
+
+import numpy as np
+cimport numpy as np
+
+assert sizeof(int) == sizeof(np.int32_t)
+
+cdef extern from "gpu_nms.hpp":
+    void _nms(np.int32_t*, int*, np.float32_t*, int, int, float, int)
+
+def gpu_nms(np.ndarray[np.float32_t, ndim=2] dets, np.float thresh,
+            np.int32_t device_id=0):
+    cdef int boxes_num = dets.shape[0]
+    cdef int boxes_dim = dets.shape[1]
+    cdef int num_out
+    cdef np.ndarray[np.int32_t, ndim=1] \
+        keep = np.zeros(boxes_num, dtype=np.int32)
+    cdef np.ndarray[np.float32_t, ndim=1] \
+        scores = dets[:, 4]
+    cdef np.ndarray[np.int32_t, ndim=1] \
+        order = scores.argsort()[::-1].astype(np.int32)
+    cdef np.ndarray[np.float32_t, ndim=2] \
+        sorted_dets = dets[order, :]
+    _nms(&keep[0], &num_out, &sorted_dets[0, 0], boxes_num, boxes_dim, thresh, device_id)
+    keep = keep[:num_out]
+    return list(order[keep])
--- a/lib/nms/nms.py
+++ b/lib/nms/nms.py
@ -0,0 +1,123 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Modified from py-faster-rcnn (https://github.com/rbgirshick/py-faster-rcnn)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from .cpu_nms import cpu_nms
+from .gpu_nms import gpu_nms
+
+
+def py_nms_wrapper(thresh):
+    def _nms(dets):
+        return nms(dets, thresh)
+    return _nms
+
+
+def cpu_nms_wrapper(thresh):
+    def _nms(dets):
+        return cpu_nms(dets, thresh)
+    return _nms
+
+
+def gpu_nms_wrapper(thresh, device_id):
+    def _nms(dets):
+        return gpu_nms(dets, thresh, device_id)
+    return _nms
+
+
+def nms(dets, thresh):
+    """
+    greedily select boxes with high confidence and overlap with current maximum <= thresh
+    rule out overlap >= thresh
+    :param dets: [[x1, y1, x2, y2 score]]
+    :param thresh: retain overlap < thresh
+    :return: indexes to keep
+    """
+    if dets.shape[0] == 0:
+        return []
+
+    x1 = dets[:, 0]
+    y1 = dets[:, 1]
+    x2 = dets[:, 2]
+    y2 = dets[:, 3]
+    scores = dets[:, 4]
+
+    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    order = scores.argsort()[::-1]
+
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+        ovr = inter / (areas[i] + areas[order[1:]] - inter)
+
+        inds = np.where(ovr <= thresh)[0]
+        order = order[inds + 1]
+
+    return keep
+
+def oks_iou(g, d, a_g, a_d, sigmas=None, in_vis_thre=None):
+    if not isinstance(sigmas, np.ndarray):
+        sigmas = np.array([.26, .25, .25, .35, .35, .79, .79, .72, .72, .62, .62, 1.07, 1.07, .87, .87, .89, .89]) / 10.0
+    vars = (sigmas * 2) ** 2
+    xg = g[0::3]
+    yg = g[1::3]
+    vg = g[2::3]
+    ious = np.zeros((d.shape[0]))
+    for n_d in range(0, d.shape[0]):
+        xd = d[n_d, 0::3]
+        yd = d[n_d, 1::3]
+        vd = d[n_d, 2::3]
+        dx = xd - xg
+        dy = yd - yg
+        e = (dx ** 2 + dy ** 2) / vars / ((a_g + a_d[n_d]) / 2 + np.spacing(1)) / 2
+        if in_vis_thre is not None:
+            ind = list(vg > in_vis_thre) and list(vd > in_vis_thre)
+            e = e[ind]
+        ious[n_d] = np.sum(np.exp(-e)) / e.shape[0] if e.shape[0] != 0 else 0.0
+    return ious
+
+def oks_nms(kpts_db, thresh, sigmas=None, in_vis_thre=None):
+    """
+    greedily select boxes with high confidence and overlap with current maximum <= thresh
+    rule out overlap >= thresh, overlap = oks
+    :param kpts_db
+    :param thresh: retain overlap < thresh
+    :return: indexes to keep
+    """
+    if len(kpts_db) == 0:
+        return []
+
+    scores = np.array([kpts_db[i]['score'] for i in range(len(kpts_db))])
+    kpts = np.array([kpts_db[i]['keypoints'].flatten() for i in range(len(kpts_db))])
+    areas = np.array([kpts_db[i]['area'] for i in range(len(kpts_db))])
+
+    order = scores.argsort()[::-1]
+
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+
+        oks_ovr = oks_iou(kpts[i], kpts[order[1:]], areas[i], areas[order[1:]], sigmas, in_vis_thre)
+
+        inds = np.where(oks_ovr <= thresh)[0]
+        order = order[inds + 1]
+
+    return keep
+
--- a/lib/nms/nms_kernel.cu
+++ b/lib/nms/nms_kernel.cu
@ -0,0 +1,143 @@
+// ------------------------------------------------------------------
+// Copyright (c) Microsoft
+// Licensed under The MIT License
+// Modified from MATLAB Faster R-CNN (https://github.com/shaoqingren/faster_rcnn)
+// ------------------------------------------------------------------
+
+#include "gpu_nms.hpp"
+#include <vector>
+#include <iostream>
+
+#define CUDA_CHECK(condition) \
+  /* Code block avoids redefinition of cudaError_t error */ \
+  do { \
+    cudaError_t error = condition; \
+    if (error != cudaSuccess) { \
+      std::cout << cudaGetErrorString(error) << std::endl; \
+    } \
+  } while (0)
+
+#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
+int const threadsPerBlock = sizeof(unsigned long long) * 8;
+
+__device__ inline float devIoU(float const * const a, float const * const b) {
+  float left = max(a[0], b[0]), right = min(a[2], b[2]);
+  float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
+  float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
+  float interS = width * height;
+  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
+  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
+  return interS / (Sa + Sb - interS);
+}
+
+__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
+                           const float *dev_boxes, unsigned long long *dev_mask) {
+  const int row_start = blockIdx.y;
+  const int col_start = blockIdx.x;
+
+  // if (row_start > col_start) return;
+
+  const int row_size =
+        min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
+  const int col_size =
+        min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);
+
+  __shared__ float block_boxes[threadsPerBlock * 5];
+  if (threadIdx.x < col_size) {
+    block_boxes[threadIdx.x * 5 + 0] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 0];
+    block_boxes[threadIdx.x * 5 + 1] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 1];
+    block_boxes[threadIdx.x * 5 + 2] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 2];
+    block_boxes[threadIdx.x * 5 + 3] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 3];
+    block_boxes[threadIdx.x * 5 + 4] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 4];
+  }
+  __syncthreads();
+
+  if (threadIdx.x < row_size) {
+    const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
+    const float *cur_box = dev_boxes + cur_box_idx * 5;
+    int i = 0;
+    unsigned long long t = 0;
+    int start = 0;
+    if (row_start == col_start) {
+      start = threadIdx.x + 1;
+    }
+    for (i = start; i < col_size; i++) {
+      if (devIoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) {
+        t |= 1ULL << i;
+      }
+    }
+    const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
+    dev_mask[cur_box_idx * col_blocks + col_start] = t;
+  }
+}
+
+void _set_device(int device_id) {
+  int current_device;
+  CUDA_CHECK(cudaGetDevice(&current_device));
+  if (current_device == device_id) {
+    return;
+  }
+  // The call to cudaSetDevice must come before any calls to Get, which
+  // may perform initialization using the GPU.
+  CUDA_CHECK(cudaSetDevice(device_id));
+}
+
+void _nms(int* keep_out, int* num_out, const float* boxes_host, int boxes_num,
+          int boxes_dim, float nms_overlap_thresh, int device_id) {
+  _set_device(device_id);
+
+  float* boxes_dev = NULL;
+  unsigned long long* mask_dev = NULL;
+
+  const int col_blocks = DIVUP(boxes_num, threadsPerBlock);
+
+  CUDA_CHECK(cudaMalloc(&boxes_dev,
+                        boxes_num * boxes_dim * sizeof(float)));
+  CUDA_CHECK(cudaMemcpy(boxes_dev,
+                        boxes_host,
+                        boxes_num * boxes_dim * sizeof(float),
+                        cudaMemcpyHostToDevice));
+
+  CUDA_CHECK(cudaMalloc(&mask_dev,
+                        boxes_num * col_blocks * sizeof(unsigned long long)));
+
+  dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
+              DIVUP(boxes_num, threadsPerBlock));
+  dim3 threads(threadsPerBlock);
+  nms_kernel<<<blocks, threads>>>(boxes_num,
+                                  nms_overlap_thresh,
+                                  boxes_dev,
+                                  mask_dev);
+
+  std::vector<unsigned long long> mask_host(boxes_num * col_blocks);
+  CUDA_CHECK(cudaMemcpy(&mask_host[0],
+                        mask_dev,
+                        sizeof(unsigned long long) * boxes_num * col_blocks,
+                        cudaMemcpyDeviceToHost));
+
+  std::vector<unsigned long long> remv(col_blocks);
+  memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);
+
+  int num_to_keep = 0;
+  for (int i = 0; i < boxes_num; i++) {
+    int nblock = i / threadsPerBlock;
+    int inblock = i % threadsPerBlock;
+
+    if (!(remv[nblock] & (1ULL << inblock))) {
+      keep_out[num_to_keep++] = i;
+      unsigned long long *p = &mask_host[0] + i * col_blocks;
+      for (int j = nblock; j < col_blocks; j++) {
+        remv[j] |= p[j];
+      }
+    }
+  }
+  *num_out = num_to_keep;
+
+  CUDA_CHECK(cudaFree(boxes_dev));
+  CUDA_CHECK(cudaFree(mask_dev));
+}
--- a/lib/nms/setup.py
+++ b/lib/nms/setup.py
@ -0,0 +1,140 @@
+# --------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Modified from py-faster-rcnn (https://github.com/rbgirshick/py-faster-rcnn)
+# --------------------------------------------------------
+
+import os
+from os.path import join as pjoin
+from setuptools import setup
+from distutils.extension import Extension
+from Cython.Distutils import build_ext
+import numpy as np
+
+
+def find_in_path(name, path):
+    "Find a file in a search path"
+    # Adapted fom
+    # http://code.activestate.com/recipes/52224-find-a-file-given-a-search-path/
+    for dir in path.split(os.pathsep):
+        binpath = pjoin(dir, name)
+        if os.path.exists(binpath):
+            return os.path.abspath(binpath)
+    return None
+
+
+def locate_cuda():
+    """Locate the CUDA environment on the system
+    Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64'
+    and values giving the absolute path to each directory.
+    Starts by looking for the CUDAHOME env variable. If not found, everything
+    is based on finding 'nvcc' in the PATH.
+    """
+
+    # first check if the CUDAHOME env variable is in use
+    if 'CUDAHOME' in os.environ:
+        home = os.environ['CUDAHOME']
+        nvcc = pjoin(home, 'bin', 'nvcc')
+    else:
+        # otherwise, search the PATH for NVCC
+        default_path = pjoin(os.sep, 'usr', 'local', 'cuda', 'bin')
+        nvcc = find_in_path('nvcc', os.environ['PATH'] + os.pathsep + default_path)
+        if nvcc is None:
+            raise EnvironmentError('The nvcc binary could not be '
+                'located in your $PATH. Either add it to your path, or set $CUDAHOME')
+        home = os.path.dirname(os.path.dirname(nvcc))
+
+    cudaconfig = {'home':home, 'nvcc':nvcc,
+                  'include': pjoin(home, 'include'),
+                  'lib64': pjoin(home, 'lib64')}
+    for k, v in cudaconfig.items():
+        if not os.path.exists(v):
+            raise EnvironmentError('The CUDA %s path could not be located in %s' % (k, v))
+
+    return cudaconfig
+CUDA = locate_cuda()
+
+
+# Obtain the numpy include directory.  This logic works across numpy versions.
+try:
+    numpy_include = np.get_include()
+except AttributeError:
+    numpy_include = np.get_numpy_include()
+
+
+def customize_compiler_for_nvcc(self):
+    """inject deep into distutils to customize how the dispatch
+    to gcc/nvcc works.
+    If you subclass UnixCCompiler, it's not trivial to get your subclass
+    injected in, and still have the right customizations (i.e.
+    distutils.sysconfig.customize_compiler) run on it. So instead of going
+    the OO route, I have this. Note, it's kindof like a wierd functional
+    subclassing going on."""
+
+    # tell the compiler it can processes .cu
+    self.src_extensions.append('.cu')
+
+    # save references to the default compiler_so and _comple methods
+    default_compiler_so = self.compiler_so
+    super = self._compile
+
+    # now redefine the _compile method. This gets executed for each
+    # object but distutils doesn't have the ability to change compilers
+    # based on source extension: we add it.
+    def _compile(obj, src, ext, cc_args, extra_postargs, pp_opts):
+        if os.path.splitext(src)[1] == '.cu':
+            # use the cuda for .cu files
+            self.set_executable('compiler_so', CUDA['nvcc'])
+            # use only a subset of the extra_postargs, which are 1-1 translated
+            # from the extra_compile_args in the Extension class
+            postargs = extra_postargs['nvcc']
+        else:
+            postargs = extra_postargs['gcc']
+
+        super(obj, src, ext, cc_args, postargs, pp_opts)
+        # reset the default compiler_so, which we might have changed for cuda
+        self.compiler_so = default_compiler_so
+
+    # inject our redefined _compile method into the class
+    self._compile = _compile
+
+
+# run the customize_compiler
+class custom_build_ext(build_ext):
+    def build_extensions(self):
+        customize_compiler_for_nvcc(self.compiler)
+        build_ext.build_extensions(self)
+
+
+ext_modules = [
+    Extension(
+        "cpu_nms",
+        ["cpu_nms.pyx"],
+        extra_compile_args={'gcc': ["-Wno-cpp", "-Wno-unused-function"]},
+        include_dirs = [numpy_include]
+    ),
+    Extension('gpu_nms',
+        ['nms_kernel.cu', 'gpu_nms.pyx'],
+        library_dirs=[CUDA['lib64']],
+        libraries=['cudart'],
+        language='c++',
+        runtime_library_dirs=[CUDA['lib64']],
+        # this syntax is specific to this build system
+        # we're only going to use certain compiler args with nvcc and not with
+        # gcc the implementation of this trick is in customize_compiler() below
+        extra_compile_args={'gcc': ["-Wno-unused-function"],
+                            'nvcc': ['-arch=sm_35',
+                                     '--ptxas-options=-v',
+                                     '-c',
+                                     '--compiler-options',
+                                     "'-fPIC'"]},
+        include_dirs = [numpy_include, CUDA['include']]
+    ),
+]
+
+setup(
+    name='nms',
+    ext_modules=ext_modules,
+    # inject our custom trigger
+    cmdclass={'build_ext': custom_build_ext},
+)
--- a/lib/utils/init.py
+++ b/lib/utils/init.py
--- a/lib/utils/transforms.py
+++ b/lib/utils/transforms.py
@ -0,0 +1,123 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import cv2
+
+
+def flip_back(output_flipped, matched_parts):
+    '''
+    ouput_flipped: numpy.ndarray(batch_size, num_joints, height, width)
+    '''
+    assert output_flipped.ndim == 4,\
+        'output_flipped should be [batch_size, num_joints, height, width]'
+
+    output_flipped = output_flipped[:, :, :, ::-1]
+
+    for pair in matched_parts:
+        tmp = output_flipped[:, pair[0], :, :].copy()
+        output_flipped[:, pair[0], :, :] = output_flipped[:, pair[1], :, :]
+        output_flipped[:, pair[1], :, :] = tmp
+
+    return output_flipped
+
+
+def fliplr_joints(joints, joints_vis, width, matched_parts):
+    """
+    flip coords
+    """
+    # Flip horizontal
+    joints[:, 0] = width - joints[:, 0] - 1
+
+    # Change left-right parts
+    for pair in matched_parts:
+        joints[pair[0], :], joints[pair[1], :] = \
+            joints[pair[1], :], joints[pair[0], :].copy()
+        joints_vis[pair[0], :], joints_vis[pair[1], :] = \
+            joints_vis[pair[1], :], joints_vis[pair[0], :].copy()
+
+    return joints*joints_vis, joints_vis
+
+
+def transform_preds(coords, center, scale, output_size):
+    target_coords = np.zeros(coords.shape)
+    trans = get_affine_transform(center, scale, 0, output_size, inv=1)
+    for p in range(coords.shape[0]):
+        target_coords[p, 0:2] = affine_transform(coords[p, 0:2], trans)
+    return target_coords
+
+
+def get_affine_transform(center,
+                         scale,
+                         rot,
+                         output_size,
+                         shift=np.array([0, 0], dtype=np.float32),
+                         inv=0):
+    if not isinstance(scale, np.ndarray) and not isinstance(scale, list):
+        print(scale)
+        scale = np.array([scale, scale])
+
+    scale_tmp = scale * 200.0
+    src_w = scale_tmp[0]
+    dst_w = output_size[0]
+    dst_h = output_size[1]
+
+    rot_rad = np.pi * rot / 180
+    src_dir = get_dir([0, src_w * -0.5], rot_rad)
+    dst_dir = np.array([0, dst_w * -0.5], np.float32)
+
+    src = np.zeros((3, 2), dtype=np.float32)
+    dst = np.zeros((3, 2), dtype=np.float32)
+    src[0, :] = center + scale_tmp * shift
+    src[1, :] = center + src_dir + scale_tmp * shift
+    dst[0, :] = [dst_w * 0.5, dst_h * 0.5]
+    dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir
+
+    src[2:, :] = get_3rd_point(src[0, :], src[1, :])
+    dst[2:, :] = get_3rd_point(dst[0, :], dst[1, :])
+
+    if inv:
+        trans = cv2.getAffineTransform(np.float32(dst), np.float32(src))
+    else:
+        trans = cv2.getAffineTransform(np.float32(src), np.float32(dst))
+
+    return trans
+
+
+def affine_transform(pt, t):
+    new_pt = np.array([pt[0], pt[1], 1.]).T
+    new_pt = np.dot(t, new_pt)
+    return new_pt[:2]
+
+
+def get_3rd_point(a, b):
+    direct = a - b
+    return b + np.array([-direct[1], direct[0]], dtype=np.float32)
+
+
+def get_dir(src_point, rot_rad):
+    sn, cs = np.sin(rot_rad), np.cos(rot_rad)
+
+    src_result = [0, 0]
+    src_result[0] = src_point[0] * cs - src_point[1] * sn
+    src_result[1] = src_point[0] * sn + src_point[1] * cs
+
+    return src_result
+
+
+def crop(img, center, scale, output_size, rot=0):
+    trans = get_affine_transform(center, scale, rot, output_size)
+
+    dst_img = cv2.warpAffine(img,
+                             trans,
+                             (int(output_size[0]), int(output_size[1])),
+                             flags=cv2.INTER_LINEAR)
+
+    return dst_img
--- a/lib/utils/utils.py
+++ b/lib/utils/utils.py
@ -0,0 +1,83 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import logging
+import time
+from pathlib import Path
+
+import torch
+import torch.optim as optim
+
+from core.config import get_model_name
+
+
+def create_logger(cfg, cfg_name, phase='train'):
+    root_output_dir = Path(cfg.OUTPUT_DIR)
+    # set up logger
+    if not root_output_dir.exists():
+        print('=> creating {}'.format(root_output_dir))
+        root_output_dir.mkdir()
+
+    dataset = cfg.DATASET.DATASET + '_' + cfg.DATASET.HYBRID_JOINTS_TYPE \
+        if cfg.DATASET.HYBRID_JOINTS_TYPE else cfg.DATASET.DATASET
+    dataset = dataset.replace(':', '_')
+    model, _ = get_model_name(cfg)
+    cfg_name = os.path.basename(cfg_name).split('.')[0]
+
+    final_output_dir = root_output_dir / dataset / model / cfg_name
+
+    print('=> creating {}'.format(final_output_dir))
+    final_output_dir.mkdir(parents=True, exist_ok=True)
+
+    time_str = time.strftime('%Y-%m-%d-%H-%M')
+    log_file = '{}_{}_{}.log'.format(cfg_name, time_str, phase)
+    final_log_file = final_output_dir / log_file
+    head = '%(asctime)-15s %(message)s'
+    logging.basicConfig(filename=str(final_log_file),
+                        format=head)
+    logger = logging.getLogger()
+    logger.setLevel(logging.INFO)
+    console = logging.StreamHandler()
+    logging.getLogger('').addHandler(console)
+
+    tensorboard_log_dir = Path(cfg.LOG_DIR) / dataset / model / \
+        (cfg_name + '_' + time_str)
+    print('=> creating {}'.format(tensorboard_log_dir))
+    tensorboard_log_dir.mkdir(parents=True, exist_ok=True)
+
+    return logger, str(final_output_dir), str(tensorboard_log_dir)
+
+
+def get_optimizer(cfg, model):
+    optimizer = None
+    if cfg.TRAIN.OPTIMIZER == 'sgd':
+        optimizer = optim.SGD(
+            model.parameters(),
+            lr=cfg.TRAIN.LR,
+            momentum=cfg.TRAIN.MOMENTUM,
+            weight_decay=cfg.TRAIN.WD,
+            nesterov=cfg.TRAIN.NESTEROV
+        )
+    elif cfg.TRAIN.OPTIMIZER == 'adam':
+        optimizer = optim.Adam(
+            model.parameters(),
+            lr=cfg.TRAIN.LR
+        )
+
+    return optimizer
+
+
+def save_checkpoint(states, is_best, output_dir,
+                    filename='checkpoint.pth.tar'):
+    torch.save(states, os.path.join(output_dir, filename))
+    if is_best and 'state_dict' in states:
+        torch.save(states['state_dict'],
+                   os.path.join(output_dir, 'model_best.pth.tar'))
--- a/lib/utils/vis.py
+++ b/lib/utils/vis.py
@ -0,0 +1,141 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import numpy as np
+import torchvision
+import cv2
+
+from core.inference import get_max_preds
+
+
+def save_batch_image_with_joints(batch_image, batch_joints, batch_joints_vis,
+                                 file_name, nrow=8, padding=2):
+    '''
+    batch_image: [batch_size, channel, height, width]
+    batch_joints: [batch_size, num_joints, 3],
+    batch_joints_vis: [batch_size, num_joints, 1],
+    }
+    '''
+    grid = torchvision.utils.make_grid(batch_image, nrow, padding, True)
+    ndarr = grid.mul(255).clamp(0, 255).byte().permute(1, 2, 0).cpu().numpy()
+    ndarr = ndarr.copy()
+
+    nmaps = batch_image.size(0)
+    xmaps = min(nrow, nmaps)
+    ymaps = int(math.ceil(float(nmaps) / xmaps))
+    height = int(batch_image.size(2) + padding)
+    width = int(batch_image.size(3) + padding)
+    k = 0
+    for y in range(ymaps):
+        for x in range(xmaps):
+            if k >= nmaps:
+                break
+            joints = batch_joints[k]
+            joints_vis = batch_joints_vis[k]
+
+            for joint, joint_vis in zip(joints, joints_vis):
+                joint[0] = x * width + padding + joint[0]
+                joint[1] = y * height + padding + joint[1]
+                if joint_vis[0]:
+                    cv2.circle(ndarr, (int(joint[0]), int(joint[1])), 2, [255, 0, 0], 2)
+            k = k + 1
+    cv2.imwrite(file_name, ndarr)
+
+
+def save_batch_heatmaps(batch_image, batch_heatmaps, file_name,
+                        normalize=True):
+    '''
+    batch_image: [batch_size, channel, height, width]
+    batch_heatmaps: ['batch_size, num_joints, height, width]
+    file_name: saved file name
+    '''
+    if normalize:
+        batch_image = batch_image.clone()
+        min = float(batch_image.min())
+        max = float(batch_image.max())
+
+        batch_image.add_(-min).div_(max - min + 1e-5)
+
+    batch_size = batch_heatmaps.size(0)
+    num_joints = batch_heatmaps.size(1)
+    heatmap_height = batch_heatmaps.size(2)
+    heatmap_width = batch_heatmaps.size(3)
+
+    grid_image = np.zeros((batch_size*heatmap_height,
+                           (num_joints+1)*heatmap_width,
+                           3),
+                          dtype=np.uint8)
+
+    preds, maxvals = get_max_preds(batch_heatmaps.detach().cpu().numpy())
+
+    for i in range(batch_size):
+        image = batch_image[i].mul(255)\
+                              .clamp(0, 255)\
+                              .byte()\
+                              .permute(1, 2, 0)\
+                              .cpu().numpy()
+        heatmaps = batch_heatmaps[i].mul(255)\
+                                    .clamp(0, 255)\
+                                    .byte()\
+                                    .cpu().numpy()
+
+        resized_image = cv2.resize(image,
+                                   (int(heatmap_width), int(heatmap_height)))
+
+        height_begin = heatmap_height * i
+        height_end = heatmap_height * (i + 1)
+        for j in range(num_joints):
+            cv2.circle(resized_image,
+                       (int(preds[i][j][0]), int(preds[i][j][1])),
+                       1, [0, 0, 255], 1)
+            heatmap = heatmaps[j, :, :]
+            colored_heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
+            masked_image = colored_heatmap*0.7 + resized_image*0.3
+            cv2.circle(masked_image,
+                       (int(preds[i][j][0]), int(preds[i][j][1])),
+                       1, [0, 0, 255], 1)
+
+            width_begin = heatmap_width * (j+1)
+            width_end = heatmap_width * (j+2)
+            grid_image[height_begin:height_end, width_begin:width_end, :] = \
+                masked_image
+            # grid_image[height_begin:height_end, width_begin:width_end, :] = \
+            #     colored_heatmap*0.7 + resized_image*0.3
+
+        grid_image[height_begin:height_end, 0:heatmap_width, :] = resized_image
+
+    cv2.imwrite(file_name, grid_image)
+
+
+def save_debug_images(config, input, meta, target, joints_pred, output,
+                      prefix):
+    if not config.DEBUG.DEBUG:
+        return
+
+    if config.DEBUG.SAVE_BATCH_IMAGES_GT:
+        save_batch_image_with_joints(
+            input, meta['joints'], meta['joints_vis'],
+            '{}_gt.jpg'.format(prefix)
+        )
+    if config.DEBUG.SAVE_BATCH_IMAGES_PRED:
+        save_batch_image_with_joints(
+            input, joints_pred, meta['joints_vis'],
+            '{}_pred.jpg'.format(prefix)
+        )
+    if config.DEBUG.SAVE_HEATMAPS_GT:
+        save_batch_heatmaps(
+            input, target, '{}_hm_gt.jpg'.format(prefix)
+        )
+    if config.DEBUG.SAVE_HEATMAPS_PRED:
+        save_batch_heatmaps(
+            input, output, '{}_hm_pred.jpg'.format(prefix)
+        )
--- a/lib/utils/zipreader.py
+++ b/lib/utils/zipreader.py
@ -0,0 +1,70 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import zipfile
+import xml.etree.ElementTree as ET
+
+import cv2
+import numpy as np
+
+_im_zfile = []
+_xml_path_zip = []
+_xml_zfile = []
+
+
+def imread(filename, flags=cv2.IMREAD_COLOR):
+    global _im_zfile
+    path = filename
+    pos_at = path.index('@')
+    if pos_at == -1:
+        print("character '@' is not found from the given path '%s'"%(path))
+        assert 0
+    path_zip = path[0: pos_at]
+    path_img = path[pos_at + 2:]
+    if not os.path.isfile(path_zip):
+        print("zip file '%s' is not found"%(path_zip))
+        assert 0
+    for i in range(len(_im_zfile)):
+        if _im_zfile[i]['path'] == path_zip:
+            data = _im_zfile[i]['zipfile'].read(path_img)
+            return cv2.imdecode(np.frombuffer(data, np.uint8), flags)
+
+    _im_zfile.append({
+        'path': path_zip,
+        'zipfile': zipfile.ZipFile(path_zip, 'r')
+    })
+    data = _im_zfile[-1]['zipfile'].read(path_img)
+
+    return cv2.imdecode(np.frombuffer(data, np.uint8), flags)
+
+
+def xmlread(filename):
+    global _xml_path_zip
+    global _xml_zfile
+    path = filename
+    pos_at = path.index('@')
+    if pos_at == -1:
+        print("character '@' is not found from the given path '%s'"%(path))
+        assert 0
+    path_zip = path[0: pos_at]
+    path_xml = path[pos_at + 2:]
+    if not os.path.isfile(path_zip):
+        print("zip file '%s' is not found"%(path_zip))
+        assert 0
+    for i in range(len(_xml_path_zip)):
+        if _xml_path_zip[i] == path_zip:
+            data = _xml_zfile[i].open(path_xml)
+            return ET.fromstring(data.read())
+    _xml_path_zip.append(path_zip)
+    print("read new xml file '%s'"%(path_zip))
+    _xml_zfile.append(zipfile.ZipFile(path_zip, 'r'))
+    data = _xml_zfile[-1].open(path_xml)
+    return ET.fromstring(data.read())
--- a/pose_estimation/_init_paths.py
+++ b/pose_estimation/_init_paths.py
@ -0,0 +1,23 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os.path as osp
+import sys
+
+
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+
+
+this_dir = osp.dirname(__file__)
+
+lib_path = osp.join(this_dir, '..', 'lib')
+add_path(lib_path)
--- a/pose_estimation/train.py
+++ b/pose_estimation/train.py
@ -0,0 +1,206 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import pprint
+import shutil
+
+import torch
+import torch.nn.parallel
+import torch.backends.cudnn as cudnn
+import torch.optim
+import torch.utils.data
+import torch.utils.data.distributed
+import torchvision.transforms as transforms
+from tensorboardX import SummaryWriter
+
+import _init_paths
+from core.config import config
+from core.config import update_config
+from core.config import update_dir
+from core.config import get_model_name
+from core.loss import JointsMSELoss
+from core.function import train
+from core.function import validate
+from utils.utils import get_optimizer
+from utils.utils import save_checkpoint
+from utils.utils import create_logger
+
+import dataset
+import models
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train keypoints network')
+    # general
+    parser.add_argument('--cfg',
+                        help='experiment configure file name',
+                        required=True,
+                        type=str)
+
+    args, rest = parser.parse_known_args()
+    # update config
+    update_config(args.cfg)
+
+    # training
+    parser.add_argument('--frequent',
+                        help='frequency of logging',
+                        default=config.PRINT_FREQ,
+                        type=int)
+    parser.add_argument('--gpus',
+                        help='gpus',
+                        type=str)
+    parser.add_argument('--workers',
+                        help='num of dataloader workers',
+                        type=int)
+
+    args = parser.parse_args()
+
+    return args
+
+
+def reset_config(config, args):
+    if args.gpus:
+        config.GPUS = args.gpus
+    if args.workers:
+        config.WORKERS = args.workers
+
+
+def main():
+    args = parse_args()
+    reset_config(config, args)
+
+    logger, final_output_dir, tb_log_dir = create_logger(
+        config, args.cfg, 'train')
+
+    logger.info(pprint.pformat(args))
+    logger.info(pprint.pformat(config))
+
+    # cudnn related setting
+    cudnn.benchmark = config.CUDNN.BENCHMARK
+    torch.backends.cudnn.deterministic = config.CUDNN.DETERMINISTIC
+    torch.backends.cudnn.enabled = config.CUDNN.ENABLED
+
+    model = eval('models.'+config.MODEL.NAME+'.get_pose_net')(
+        config, is_train=True
+    )
+
+    # copy model file
+    this_dir = os.path.dirname(__file__)
+    shutil.copy2(
+        os.path.join(this_dir, '../lib/models', config.MODEL.NAME + '.py'),
+        final_output_dir)
+
+    writer_dict = {
+        'writer': SummaryWriter(log_dir=tb_log_dir),
+        'train_global_steps': 0,
+        'valid_global_steps': 0,
+    }
+
+    dump_input = torch.rand((config.TRAIN.BATCH_SIZE,
+                             config.MODEL.NUM_JOINTS,
+                             config.MODEL.IMAGE_SIZE[1],
+                             config.MODEL.IMAGE_SIZE[0]))
+    writer_dict['writer'].add_graph(model, (dump_input, ))
+
+    gpus = [int(i) for i in config.GPUS.split(',')]
+    model = torch.nn.DataParallel(model, device_ids=gpus).cuda()
+
+    # define loss function (criterion) and optimizer
+    criterion = JointsMSELoss(
+        use_target_weight=config.LOSS.USE_TARGET_WEIGHT
+    ).cuda()
+
+    optimizer = get_optimizer(config, model)
+
+    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(
+        optimizer, config.TRAIN.LR_STEP, config.TRAIN.LR_FACTOR
+    )
+
+    # Data loading code
+    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                     std=[0.229, 0.224, 0.225])
+    train_dataset = eval('dataset.'+config.DATASET.DATASET)(
+        config,
+        config.DATASET.ROOT,
+        config.DATASET.TRAIN_SET,
+        True,
+        transforms.Compose([
+            transforms.ToTensor(),
+            normalize,
+        ])
+    )
+    valid_dataset = eval('dataset.'+config.DATASET.DATASET)(
+        config,
+        config.DATASET.ROOT,
+        config.DATASET.TEST_SET,
+        False,
+        transforms.Compose([
+            transforms.ToTensor(),
+            normalize,
+        ])
+    )
+
+    train_loader = torch.utils.data.DataLoader(
+        train_dataset,
+        batch_size=config.TRAIN.BATCH_SIZE*len(gpus),
+        shuffle=config.TRAIN.SHUFFLE,
+        num_workers=config.WORKERS,
+        pin_memory=True
+    )
+    valid_loader = torch.utils.data.DataLoader(
+        valid_dataset,
+        batch_size=config.TEST.BATCH_SIZE*len(gpus),
+        shuffle=False,
+        num_workers=config.WORKERS,
+        pin_memory=True
+    )
+
+    best_perf = 0.0
+    best_model = False
+    for epoch in range(config.TRAIN.BEGIN_EPOCH, config.TRAIN.END_EPOCH):
+        lr_scheduler.step()
+
+        # train for one epoch
+        train(config, train_loader, model, criterion, optimizer, epoch,
+              final_output_dir, tb_log_dir, writer_dict)
+
+
+        # evaluate on validation set
+        perf_indicator = validate(config, valid_loader, valid_dataset, model,
+                                  criterion, final_output_dir, tb_log_dir,
+                                  writer_dict)
+
+        if perf_indicator > best_perf:
+            best_perf = perf_indicator
+            best_model = True
+        else:
+            best_model = False
+
+        logger.info('=> saving checkpoint to {}'.format(final_output_dir))
+        save_checkpoint({
+            'epoch': epoch + 1,
+            'model': get_model_name(config),
+            'state_dict': model.state_dict(),
+            'perf': perf_indicator,
+            'optimizer': optimizer.state_dict(),
+        }, best_model, final_output_dir)
+
+    final_model_state_file = os.path.join(final_output_dir,
+                                          'final_state.pth.tar')
+    logger.info('saving final model state to {}'.format(
+        final_model_state_file))
+    torch.save(model.module.state_dict(), final_model_state_file)
+    writer_dict['writer'].close()
+
+
+if __name__ == '__main__':
+    main()
--- a/pose_estimation/valid.py
+++ b/pose_estimation/valid.py
@ -0,0 +1,165 @@
+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import pprint
+
+import torch
+import torch.nn.parallel
+import torch.backends.cudnn as cudnn
+import torch.optim
+import torch.utils.data
+import torch.utils.data.distributed
+import torchvision.transforms as transforms
+
+import _init_paths
+from core.config import config
+from core.config import update_config
+from core.config import update_dir
+from core.loss import JointsMSELoss
+from core.function import validate
+from utils.utils import create_logger
+
+import dataset
+import models
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train keypoints network')
+    # general
+    parser.add_argument('--cfg',
+                        help='experiment configure file name',
+                        required=True,
+                        type=str)
+
+    args, rest = parser.parse_known_args()
+    # update config
+    update_config(args.cfg)
+
+    # training
+    parser.add_argument('--frequent',
+                        help='frequency of logging',
+                        default=config.PRINT_FREQ,
+                        type=int)
+    parser.add_argument('--gpus',
+                        help='gpus',
+                        type=str)
+    parser.add_argument('--workers',
+                        help='num of dataloader workers',
+                        type=int)
+    parser.add_argument('--model-file',
+                        help='model state file',
+                        type=str)
+    parser.add_argument('--use-detect-bbox',
+                        help='use detect bbox',
+                        action='store_true')
+    parser.add_argument('--flip-test',
+                        help='use flip test',
+                        action='store_true')
+    parser.add_argument('--post-process',
+                        help='use post process',
+                        action='store_true')
+    parser.add_argument('--shift-heatmap',
+                        help='shift heatmap',
+                        action='store_true')
+    parser.add_argument('--coco-bbox-file',
+                        help='coco detection bbox file',
+                        type=str)
+
+    args = parser.parse_args()
+
+    return args
+
+
+def reset_config(config, args):
+    if args.gpus:
+        config.GPUS = args.gpus
+    if args.workers:
+        config.WORKERS = args.workers
+    if args.use_detect_bbox:
+        config.TEST.USE_GT_BBOX = not args.use_detect_bbox
+    if args.flip_test:
+        config.TEST.FLIP_TEST = args.flip_test
+    if args.post_process:
+        config.TEST.POST_PROCESS = args.post_process
+    if args.shift_heatmap:
+        config.TEST.SHIFT_HEATMAP = args.shift_heatmap
+    if args.model_file:
+        config.TEST.MODEL_FILE = args.model_file
+    if args.coco_bbox_file:
+        config.TEST.COCO_BBOX_FILE = args.coco_bbox_file
+
+
+def main():
+    args = parse_args()
+    reset_config(config, args)
+
+    logger, final_output_dir, tb_log_dir = create_logger(
+        config, args.cfg, 'valid')
+
+    logger.info(pprint.pformat(args))
+    logger.info(pprint.pformat(config))
+
+    # cudnn related setting
+    cudnn.benchmark = config.CUDNN.BENCHMARK
+    torch.backends.cudnn.deterministic = config.CUDNN.DETERMINISTIC
+    torch.backends.cudnn.enabled = config.CUDNN.ENABLED
+
+    model = eval('models.'+config.MODEL.NAME+'.get_pose_net')(
+        config, is_train=False
+    )
+
+    if config.TEST.MODEL_FILE:
+        logger.info('=> loading model from {}'.format(config.TEST.MODEL_FILE))
+        model.load_state_dict(torch.load(config.TEST.MODEL_FILE))
+    else:
+        model_state_file = os.path.join(final_output_dir,
+                                        'final_state.pth.tar')
+        logger.info('=> loading model from {}'.format(model_state_file))
+        model.load_state_dict(torch.load(model_state_file))
+
+    gpus = [int(i) for i in config.GPUS.split(',')]
+    model = torch.nn.DataParallel(model, device_ids=gpus).cuda()
+
+    # define loss function (criterion) and optimizer
+    criterion = JointsMSELoss(
+        use_target_weight=config.LOSS.USE_TARGET_WEIGHT
+    ).cuda()
+
+    # Data loading code
+    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                     std=[0.229, 0.224, 0.225])
+    valid_dataset = eval('dataset.'+config.DATASET.DATASET)(
+        config,
+        config.DATASET.ROOT,
+        config.DATASET.TEST_SET,
+        False,
+        transforms.Compose([
+            transforms.ToTensor(),
+            normalize,
+        ])
+    )
+    valid_loader = torch.utils.data.DataLoader(
+        valid_dataset,
+        batch_size=config.TEST.BATCH_SIZE*len(gpus),
+        shuffle=False,
+        num_workers=config.WORKERS,
+        pin_memory=True
+    )
+
+    # evaluate on validation set
+    validate(config, valid_loader, valid_dataset, model, criterion,
+             final_output_dir, tb_log_dir)
+
+
+if __name__ == '__main__':
+    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,9 @@
+EasyDict==1.7
+opencv-python==3.4.1.15
+Cython
+scipy
+pandas
+pyyaml
+json_tricks
+scikit-image
+tensorboardX