Shiftry Release Code (#213)

* Copying new seedot files * Added batching for compiling large programs without blowing up memory * Updated README.md to FastGRNN example * Fixed copyright messages, included m3 codegen files * Removed unnecessary commented code * Added high level comments and added dlx executables * Fixed formatting of C/C++ files * Adding architecture diagram * Adding tutorial on extending the compiler * Fix typo * Fix Scale Nomenclature and Remove Redundant Function Arguments in MatVec Code * Consolidate c_reference/ and m3/library codes * Revise C and C++ Code To The Agreed Convention * Remove Redundant M3 Library and Add TODO for Memory Management * Re-Index Sparse Model According to Optimal Strategy * Fix Redundant Array Issue * Fix stray spaces and grammatical issues * Make Multi-Dimensional Arrays Uni-Dimensional for M3 * Hacky Fix for Widx and Uidx * Fix Grammar and Remove Unused Code * Make SeeDot Directly Runnable * Remove Stray Spaces * Fix Grammar * Fix Comments * Incorporate Changes from Aayan * Fix x86 pipeline on FirstFitPriority * Optimize x86 pipeline on FirstFitPriority * Add Wno-narrowing flag to x86 codegen * Prettier Printing. (#1) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Minor Improvements * Changes to improve README.md in SeeDot and respective modifications. (#2) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added instructions to fetch face detection dataset and licenses to newly added files (#3) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Changes to face detection readme (#4) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Minor Nitpicks * Fixed copy instructions in faceDetection.md (#5) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Fixed location of copying from SCUT_Head_Part_B * Minor addition to faceDetection.md * GCC version check added to build.py (#6) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Fixed location of copying from SCUT_Head_Part_B * Minor addition to faceDetection.md * Gcc version restriction added to seedot/Predictor/Makefile * Added GCC version check to SeeDot * Removed extra new-line * Data processing (#7) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Fixed location of copying from SCUT_Head_Part_B * Minor addition to faceDetection.md * Gcc version restriction added to seedot/Predictor/Makefile * Added GCC version check to SeeDot * Removed extra new-line * Added data preprocessing line * The script to fix ranges (#8) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Fixed location of copying from SCUT_Head_Part_B * Minor addition to faceDetection.md * Gcc version restriction added to seedot/Predictor/Makefile * Added GCC version check to SeeDot * Removed extra new-line * Added data preprocessing line * Added cpu conversion in convert_RPool_Face_to_SeeDot.py * Added script to process seedot input files * Increased precision of printing floating point in fixSeeDotInput.py * Updated instructions to update ranges by script * Add newline at the end of fixSeeDotInput.py * ONNX dependency added to README (#9) * Replaced the print stagement using logging (error, warning, info and debug levels used). Added progress bar using tqdm package in performSearch function for the 4 stages of exploration. Changes Antlr Check version to 4.9.1. Added requirements.txt * Added more information to the print statment that prints the stage of exploration. * Added log level as a command line argument. Added descriptive error messages. * Reverted parameters to old state * Slight changes to the strings printed by logging * Resolved formatting comments * Updated README.md with comand to change logging level * Converted the log level input arguments to lowercase strings * Updated requirements and Readme * Replaced argument version with argument encoding Removed dependency on Results.csv * Readme formatting change * Changed the help strings in argument parser, renamed rnn as fastgrnn * Some more changes to SeeDot-dev.py for the argument parser. Added line that shows that the available fastgrnn model belongs to usps10 dataset. * Added descriptive comments to python files except ONNX and TF folders. * Removed unnecessary comma in launch.json * Corrected default SeeDot values in README.md * Minor changes to improv README.md's readablility. * Added legacy_scales.csv and restored the legacy_scales processor in SeeDot-dev.py * Replaced rnn keyword with fastgrnn in launch.json * Changed all comments to a single format * Added face detection run instructions. Removed lsf flag from README and uppressed it in help. * Replaced maximisingMetric with metric in the main.py, config.py and test.py. Made the respective changes in README.md. * Removed instructions to run face detection * Architecture.md started * Architecture.md iteration 1 completed * Improved line spacing * Removed incorrect line * Replaced architecture.svg * replaced version with encoding * made FastGRNN on usps10 with reduced disagreemts the default for seedot * Points added to architecture.md * Included suggestions from reviewer. * Added rnnpool.sd to folder of seedot models * Added faceDetection code fetcher * Added face detection run instructions and licenses * Updated face Detection instructions and readme.md * Minor correction to readmes * Added rnnpool as an option in SeeDot-dev.py * Added support for face-2 dataset * Error in cleanup in fetchFDDataset.py * Added information about input to Predictor in faceDetection.md * Included reviewr comments to fix typos * Fixed typos in faceDetection.md * Added code to run the quantized face detection on images * Added instructions to run face detection on extra images in faceDetection.md * Added argument support to scale_image.pu * Fixed typos in faceDetection.md * Added info about the output of eval_fromquant in faceDetection.md * Added instructions to fetch an image in faceDetection.md * Added load_sf condition to printing PASS or FAIL * Added instructions to install libraries for detecting bounding boxes * Minor changes to faceDetection.md and main.cpp * Removed duplicate files from the repository and added instructions to copy them for running image to text converters and vice-versa * Minor change to faceDetection.md * Updated faceDetection.md * Added trace to seedot converstion scripts * Fixed trace dimensions in convert_to_seedot files * Removed torch from the requirements.txt file * Updated face detection readme * Updated faceDetection.md with corrected instructions to install edgeml_pytorch * Reset the changes made to requiments-cpu.txt and requirements-gpu.txt in EdgeML/pytorch * Added multi-line commands to face detection.md * Added instructions for SeeDot installation as well * Some more refactoring * Merged the two convert_to_seedot_scripts into one * fixed typo in convert_RPool_Face_to_SeeDot * Minor addition to faceDetection.md * Fixed location of copying from SCUT_Head_Part_B * Minor addition to faceDetection.md * Gcc version restriction added to seedot/Predictor/Makefile * Added GCC version check to SeeDot * Removed extra new-line * Added data preprocessing line * Added cpu conversion in convert_RPool_Face_to_SeeDot.py * Added script to process seedot input files * Increased precision of printing floating point in fixSeeDotInput.py * Updated instructions to update ranges by script * Add newline at the end of fixSeeDotInput.py * Added ONNX as a dependencey in README * Minor Fixes * Improve README Co-authored-by: ShikharJ <jaiswalshikhar87@gmail.com> Co-authored-by: G Rahul Kranti Kiran <krantikiran.68@gmail.com>
2021-07-08 12:57:38 -05:00 · 2021-07-08 12:57:38 -05:00 · ef9f8a77f0
--- a/c_reference/include/quantized_fastgrnn.h
+++ b/c_reference/include/quantized_fastgrnn.h
@ -58,34 +58,30 @@ typedef struct Q15_FastGRNN_LR_Scales {
  SCALE_T meanSub;
  SCALE_T stdDev;
  SCALE_T normFeaturesHDStdDev;
-  SCALE_T W1;
+  SCALE_T w1;
  SCALE_T normFeaturesMVW1;
-  SCALE_T H1W1;
-  SCALE_T H2W1;
-  SCALE_T W2;
+  SCALE_T mVW1Out;
+  SCALE_T w2;
  SCALE_T tempLRW;
-  SCALE_T H1W2;
-  SCALE_T H2W2;
-  SCALE_T U1;
+  SCALE_T mVW2Out;
+  SCALE_T u1;
  SCALE_T hiddenStateMVU1;
-  SCALE_T H1U1;
-  SCALE_T H2U1;
-  SCALE_T U2;
+  SCALE_T mVU1Out;
+  SCALE_T u2;
  SCALE_T tempLRU;
-  SCALE_T H1U2;
-  SCALE_T H2U2;
+  SCALE_T mVU2Out;
  SCALE_T mV2AddMV4;
  SCALE_T mV4AddMV2;
  SCALE_T mV2AddMV4Out;
  SCALE_T mV2AddMV4Demote;
  SCALE_T pC1AddBg;
-  SCALE_T Bg;
+  SCALE_T bg;
  SCALE_T pC1AddBgOut;
  SCALE_T pC1AddBgDemote;
  SCALE_T sigmoidScaleIn;
  SCALE_T sigmoidScaleOut;
  SCALE_T pC1AddBh;
-  SCALE_T Bh;
+  SCALE_T bh;
  SCALE_T pC1AddBhOut;
  SCALE_T pC1AddBhDemote;
  SCALE_T tanhScaleIn;
@ -220,26 +216,24 @@ typedef struct Q15_FastGRNN_Scales {
  SCALE_T meanSub;
  SCALE_T stdDev;
  SCALE_T normFeaturesHDStdDev;
-  SCALE_T W;
+  SCALE_T w;
  SCALE_T normFeaturesMVW;
-  SCALE_T H1W;
-  SCALE_T H2W;
-  SCALE_T U;
+  SCALE_T mVWOut;
+  SCALE_T u;
  SCALE_T hiddenStateMVU;
-  SCALE_T H1U;
-  SCALE_T H2U;
+  SCALE_T mVUOut;
  SCALE_T mV1AddMV2;
  SCALE_T mV2AddMV1;
  SCALE_T mV1AddMV2Out;
  SCALE_T mV1AddMV2Demote;
  SCALE_T pC1AddBg;
-  SCALE_T Bg;
+  SCALE_T bg;
  SCALE_T pC1AddBgOut;
  SCALE_T pC1AddBgDemote;
  SCALE_T sigmoidScaleIn;
  SCALE_T sigmoidScaleOut;
  SCALE_T pC1AddBh;
-  SCALE_T Bh;
+  SCALE_T bh;
  SCALE_T pC1AddBhOut;
  SCALE_T pC1AddBhDemote;
  SCALE_T tanhScaleIn;
--- a/c_reference/include/quantized_utils.h
+++ b/c_reference/include/quantized_utils.h
@ -268,6 +268,24 @@ void q15_v_scale_up(const Q15_T* vec, ITER_T len, Q15_T* ret, SCALE_T scvec);
 */
 void q15_v_scale_down(const Q15_T* vec, ITER_T len, Q15_T* ret, SCALE_T scvec);

+/**
+ * @brief Performs the row-order or the column-order reversal of the 2-D input matrix.
+ * @param[in]       mat       pointer to the (row / column-major) input matrix on which reversal is to be performed
+ * @param[in]       nrows     number of rows of the input matrix
+ * @param[in]       ncols     number of columns of the input matrix
+ * @param[in]       axis      axis of reversal; 0 for reversal along rows and 1 for reversal along columns
+ * @param[out]      ret       pointer to the output matrix
+ * @return          none
+ * @example         mat       = { {1, 2},
+ *                                {4, 5} }
+ *                  nrows     = 2
+ *                  ncols     = 2
+ *                  axis      = 0
+ *                  ret       = { {4, 5},
+ *                                {1, 2} }
+ */
+void q15_m_reverse(const Q15_T* const mat, ITER_T nrows, ITER_T ncols,
+                   ITER_T axis, Q15_T* const ret);
 /**
 * @brief Performs the matrix multiplication of a matrix and a vector.
 * @param[in]       mat       pointer to input matrix in row-major order
@ -277,9 +295,7 @@ void q15_v_scale_down(const Q15_T* vec, ITER_T len, Q15_T* ret, SCALE_T scvec);
 * @param[out]      ret       pointer to the output vector
 * @param[in]       scmat     scale factor of the input matrix
 * @param[in]       scvec     scale factor of the input vector
- * @param[in]       H1        depth parameter for division-by-two used in TreeSum
- * @param[in]       H2        depth parameter for direct sum used in TreeSum
-
+ * @param[in]       scret     scale factor of the output vector
 * @return          none
 * @example         mat       = { {7069, -10389, 1562, -1992},
 *                                {3262, -37, -1143, -995},
@ -294,16 +310,15 @@ void q15_v_scale_down(const Q15_T* vec, ITER_T len, Q15_T* ret, SCALE_T scvec);
 *                  ncols     = 4
 *                  scmat     = 128
 *                  scvec     = 64
- *                  H1        = 2
- *                  H2        = 0
+ *                  scret     = 2
 *                  ret       = {-425, -169, -3534, 524, -2739, 87, 52, 292}
 */
 void q15xq7_q15_m_mulvec(const Q15_T* mat, const Q7_T* const vec, ITER_T nrows,
                         ITER_T ncols, Q15_T* ret, SCALE_T scmat,
-                         SCALE_T scvec, SCALE_T H1, SCALE_T H2);
+                         SCALE_T scvec, SCALE_T scret);
 void q15_m_mulvec(const Q15_T* mat, const Q15_T* const vec, ITER_T nrows,
                  ITER_T ncols, Q15_T* ret, SCALE_T scmat, SCALE_T scvec,
-                  SCALE_T H1, SCALE_T H2);
+                  SCALE_T scret);
 /**
 * @brief Performs sparse matrix multiplication of a matrix and a vector.
 * row_indices and mat_values combined are a sparse representation; dim(vec) = [ncols].
@ -312,13 +327,11 @@ void q15_m_mulvec(const Q15_T* mat, const Q15_T* const vec, ITER_T nrows,
 * @param[in]       row_indices  pointer to input matrix which stores the row indices of non-zero values of matrix A
 * @param[in]       mat_values   pointer to input matrix which stores the non-zero values of matrix A
 * @param[in]       vec          pointer to the input vector
- * @param[in]       nrows        number of rows of the input matrix
- * @param[in]       ncols        number of columns of the input matrix
+ * @param[in]       nelem        number of elements in the input vector
 * @param[out]      ret          pointer to the output vector
 * @param[in]       scmat        scale factor of the input matrix
 * @param[in]       scvec        scale factor of the input vector
- * @param[in]       H1           depth parameter for division-by-two used in TreeSum
- * @param[in]       H2           depth parameter for direct sum used in TreeSum
+ * @param[in]       scret        scale factor of the output vector
 * @return          none
 * @example         mat          = { {23, 32, 0},
 *                                   {0, 0, 1},
@ -326,23 +339,19 @@ void q15_m_mulvec(const Q15_T* mat, const Q15_T* const vec, ITER_T nrows,
 *                  row_indices  = {1, 3, 0, 1, 0, 2, 0}
 *                  mat_values   = {23, 48, 32, 1}
 *                  vec          = {1, 2, 3}
- *                  nrows        = 3
- *                  ncols        = 3
+ *                  nelem        = 3
 *                  scmat        = 1
 *                  scvec        = 1
- *                  H1           = 1
- *                  H2           = 0
+ *                  scret        = 1
 *                  ret          = {87, 3, 48}
 */
 void q15xq7_q15_m_sparse_mulvec(const ITER_T* row_indices,
                                const Q15_T* mat_values, const Q7_T* vec,
-                                ITER_T nrows, ITER_T ncols, Q15_T* ret,
-                                SCALE_T scmat, SCALE_T scvec, SCALE_T H1,
-                                SCALE_T H2);
+                                ITER_T nelem, Q15_T* ret, SCALE_T scmat,
+                                SCALE_T scvec, SCALE_T scret);
 void q15_m_sparse_mulvec(const ITER_T* row_indices, const Q15_T* mat_values,
-                         const Q15_T* vec, ITER_T nrows, ITER_T ncols,
-                         Q15_T* ret, SCALE_T scmat, SCALE_T scvec, SCALE_T H1,
-                         SCALE_T H2);
+                         const Q15_T* vec, ITER_T nelem, Q15_T* ret,
+                         SCALE_T scmat, SCALE_T scvec, SCALE_T scret);

 /**
 * @brief Performs the element-wise addition of two input tensors.
--- a/c_reference/models/Makefile
+++ b/c_reference/models/Makefile
@ -4,7 +4,7 @@
 include ../config.mk

 INCLUDE_DIR=../include
-IFLAGS = -I $(INCLUDE_DIR)
+IFLAGS=-I $(INCLUDE_DIR)

 all: quantized_face_detection.o quantized_face_detection_fast.o quantized_face_detection_sparse.o

--- a/c_reference/models/q_scut_head_b_face2_model/rnn1.h
+++ b/c_reference/models/q_scut_head_b_face2_model/rnn1.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:455faca056eaec8069e17ed4829c5f6f4e513c972fd569e659c08e5a95b03d2e
-size 5908
+oid sha256:40a14a8e0692894848e97826a965dbcdce270ce9233e178e82106a2178662fcd
+size 5864
--- a/c_reference/models/q_scut_head_b_face2_model/rnn2.h
+++ b/c_reference/models/q_scut_head_b_face2_model/rnn2.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2132f05705188ae5008d6f43120c625c5995bb4133aff4327e4232edeb3baac1
-size 7073
+oid sha256:eb84c467c95720afed79bbcb211edb846508ac45463dc7155b794056819f2191
+size 7029
--- a/c_reference/models/q_scut_head_b_face3_model/rnn1.h
+++ b/c_reference/models/q_scut_head_b_face3_model/rnn1.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fbe2ceaa3e3cce2a7fc844311b0be170fd703f9ed0cb9ac4f41ecfc600bb0f53
-size 5928
+oid sha256:7572877a2e8e46ab9a9cba9ed70d162169cd7890acc9e86b93bc8659c99adc5d
+size 5884
--- a/c_reference/models/q_scut_head_b_face3_model/rnn2.h
+++ b/c_reference/models/q_scut_head_b_face3_model/rnn2.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ff0b4a711a06ed71e8bbf1682c4d9f7ee17d25cca7f291c9a4b3347daae52e6f
-size 7017
+oid sha256:ca6d9233acda744fe862df42c4ecb56c96770d215c1654b1964efe7c04ec3a5d
+size 6973
--- a/c_reference/models/q_scut_head_b_face4_model/rnn1.h
+++ b/c_reference/models/q_scut_head_b_face4_model/rnn1.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f8144562a7d7cf9bba5f7275695ea5a29fa8b9b58e58afa88c0842009375352d
-size 7161
+oid sha256:c2f905707060b677d647588826f979be5615fbdd76e4140dceb3a601e9506f0b
+size 7117
--- a/c_reference/models/q_scut_head_b_face4_model/rnn2.h
+++ b/c_reference/models/q_scut_head_b_face4_model/rnn2.h
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d246659add6478ae9f4daf6fb314867231ef1d2dd5c80251f3a02d9b8525d232
-size 8352
+oid sha256:0db39a84fc660f0b9f0411d0c47ee64e2edeec7bc16cac8de33186471c67135d
+size 8308
--- a/c_reference/models/quantized_face_detection_sparse.c
+++ b/c_reference/models/quantized_face_detection_sparse.c
@ -23,36 +23,36 @@

 void q_face_detection_sparse(char* const mem_buf) {
  // Conv2D Sub-Pipeline
-  q7xq15_q7_convolution((Q7_T*)mem_buf, CBR1F, (Q7_T*)(mem_buf + 76800),
+  q7xq15_q7_convolution((Q7_T*)(mem_buf + 76800), CBR1F, (Q7_T*)mem_buf,
    CONV2D_N, CBR1F_H, CBR1F_W, CBR1F_CIN, CBR1F_HF, CBR1F_WF, CBR1F_CF,
    CONV2D_COUT, CONV2D_HOUT, CONV2D_WOUT, CBR1F_G, CBR1F_HPADL, CBR1F_HPADR,
    CBR1F_WPADL, CBR1F_WPADR, CBR1F_HSTRIDE, CBR1F_WSTRIDE, CBR1F_HDILATION,
    CBR1F_WDILATION, CBR1F_Scinput, CBR1F_Scoutput, CBR1F_Demote);

-  q7xq15_q7_t_add_vec((Q7_T*)(mem_buf + 76800), CBR1B, CONV2D_N, CONV2D_HOUT,
-    CONV2D_WOUT, CONV2D_COUT, (Q7_T*)(mem_buf + 76800), CBR1B_Scten, CBR1B_Scvec,
+  q7xq15_q7_t_add_vec((Q7_T*)mem_buf, CBR1B, CONV2D_N, CONV2D_HOUT,
+    CONV2D_WOUT, CONV2D_COUT, (Q7_T*)mem_buf, CBR1B_Scten, CBR1B_Scvec,
    CBR1B_Scret);

-  q7xq15_q7_convolution((Q7_T*)(mem_buf + 76800), CBR1W, (Q7_T*)mem_buf, CONV2D_N,
+  q7xq15_q7_convolution((Q7_T*)mem_buf, CBR1W, (Q7_T*)mem_buf, CONV2D_N,
    CONV2D_HOUT, CONV2D_WOUT, CONV2D_COUT, CBR1W_HF, CBR1W_WF, CBR1W_CF,
    CBR1W_COUT, CONV2D_HOUT, CONV2D_WOUT, CBR1W_G, CBR1W_HPADL, CBR1W_HPADR,
    CBR1W_WPADL, CBR1W_WPADR, CBR1W_HSTRIDE, CBR1W_WSTRIDE, CBR1W_HDILATION,
    CBR1W_WDILATION, CBR1W_Scinput, CBR1W_Scoutput, CBR1W_Demote);

  q7_t_relu((Q7_T*)mem_buf, CONV2D_N, CONV2D_HOUT, CONV2D_WOUT, CONV2D_COUT,
-    (Q7_T*)mem_buf, CONV2D_Limit, CONV2D_Div);
+    (Q7_T*)(mem_buf + 76800), CONV2D_Limit, CONV2D_Div);

  Q7_T* mem_buf_offset_q7 = (Q7_T*)mem_buf;
  Q15_T* mem_buf_offset_q15 = (Q15_T*)mem_buf;

  // RNNPool Sub-Pipeline
-  memset((mem_buf + 76800), 0, sizeof(Q7_T) * 76800);
-  memset((mem_buf + 153600), 0, sizeof(Q15_T));
-  memset((mem_buf + 153602), 0, sizeof(Q15_T));
+  memset(mem_buf, 0, sizeof(Q7_T) * 76800);
+  memset((mem_buf + 155392), 0, sizeof(Q15_T));
+  memset((mem_buf + 155648), 0, sizeof(Q15_T));

  for (ITER_T patch_x = 0; (patch_x < 29); patch_x++) {
    for (ITER_T patch_y = 0; (patch_y < 39); patch_y++) {
-      q7xq15_q15_rnnpool_block((Q7_T*)(mem_buf + ((2560 * patch_x) + (16 * patch_y))),
+      q7xq15_q15_rnnpool_block((Q7_T*)(mem_buf + 76800 + ((2560 * patch_x) + (16 * patch_y))),
        INPUT_CHANNELS, PATCH_DIM, CONV2D_WOUT, q7xq15_q15_fastgrnn, HIDDEN_DIM1,
        (const void*)(&RNN1_PARAMS), (void*)(&RNN1_BUFFERS),
        (const void*)(&RNN1_SCALES), q15_fastgrnn, HIDDEN_DIM2,
@ -61,39 +61,39 @@ void q_face_detection_sparse(char* const mem_buf) {
        (Q15_T*)(mem_buf + 153900), ShR1, ShL1, ShR2, ShL2);

      for (ITER_T i = 0; i < 64; i++) {
-        mem_buf_offset_q7[76800 + patch_x * 2560 + patch_y * 64 + i] = (Q7_T)(mem_buf_offset_q15[76875 + i]);
+        mem_buf_offset_q7[patch_x * 2560 + patch_y * 64 + i] = (Q7_T)(mem_buf_offset_q15[76875 + i]);
      }
    }
  }

-  memcpy(&mem_buf_offset_q7[76800 + 29 * 2560], &mem_buf_offset_q7[76800 + 28 * 2560],
+  memcpy(&mem_buf_offset_q7[29 * 2560], &mem_buf_offset_q7[28 * 2560],
         39 * 64 * sizeof(Q7_T));
  for (ITER_T i = 0; i < 30; i++) {
-    memcpy(&mem_buf_offset_q7[76800 + 39 * 64 + i * 2560],
-           &mem_buf_offset_q7[76800 + 38 * 64 + i * 2560], 64 * sizeof(Q7_T));
+    memcpy(&mem_buf_offset_q7[39 * 64 + i * 2560],
+           &mem_buf_offset_q7[38 * 64 + i * 2560], 64 * sizeof(Q7_T));
  } 

  // MBConv Sub-Pipeline
  // MBConv Layer 1
-  q7xq15_q15_mbconv_block((Q7_T*)(mem_buf + 76800), L1_F1, L1_W1, L1_B1, L1_F2,
-    L1_W2, L1_B2, L1_F3, L1_W3, L1_B3, (Q15_T*)mem_buf,
-    (Q15_T*)(mem_buf + 153600), (Q15_T*)(mem_buf + 184832), L1_N, L1_H, L1_W,
+  q7xq15_q15_mbconv_block((Q7_T*)mem_buf, L1_F1, L1_W1, L1_B1, L1_F2,
+    L1_W2, L1_B2, L1_F3, L1_W3, L1_B3, (Q15_T*)(mem_buf + 76800),
+    (Q15_T*)(mem_buf + 153600), (Q15_T*)(mem_buf + 184320), L1_N, L1_H, L1_W,
    L1_CIN, L1_CTEMP, L1_HF, L1_WF, L1_COUT, L1_HOUT, L1_WOUT, L1_HPADL,
    L1_HPADR, L1_WPADL, L1_WPADR, L1_HSTRIDE, L1_WSTRIDE, L1_Limit1, L1_Limit2,
    L1_ShRU1, L1_ShRX1, L1_ShRU2, L1_ShRX2, L1_ShRU3, L1_ShRW3, L1_ShLU1,
    L1_ShLX1, L1_ShLU2, L1_ShLX2, L1_ShLU3, L1_ShLW3);

  // Detection Layer 1 Sub-Pipeline
-  q15_t_l2_norm((Q15_T*)mem_buf, L1_N, L1_HOUT, L1_WOUT, L1_COUT,
-    (Q15_T*)(mem_buf + 76800), D1_ScaleIn, D1_ScaleOut);
+  q15_t_l2_norm((Q15_T*)(mem_buf + 76800), L1_N, L1_HOUT, L1_WOUT, L1_COUT,
+    (Q15_T*)mem_buf, D1_ScaleIn, D1_ScaleOut);

-  q15_convolution((Q15_T*)(mem_buf + 76800), D1NW, (Q15_T*)(mem_buf + 76800),
+  q15_convolution((Q15_T*)mem_buf, D1NW, (Q15_T*)mem_buf,
    L1_N, L1_HOUT, L1_WOUT, L1_COUT, D1NW_HF, D1NW_WF, D1NW_CF, D1NW_COUT,
    L1_HOUT, L1_WOUT, D1NW_G, D1NW_HPADL, D1NW_HPADR, D1NW_WPADL, D1NW_WPADR,
    D1NW_HSTRIDE, D1NW_WSTRIDE, D1NW_HDILATION, D1NW_WDILATION, D1NW_Scinput,
    D1NW_Scoutput, D1NW_Demote);

-  q15_convolution((Q15_T*)(mem_buf + 76800), D1CW, (Q15_T*)(mem_buf + 153600), L1_N,
+  q15_convolution((Q15_T*)mem_buf, D1CW, (Q15_T*)(mem_buf + 153600), L1_N,
    L1_HOUT, L1_WOUT, D1NW_COUT * D1NW_G, D1CW_HF, D1CW_WF, D1CW_CF, D1CW_COUT,
    L1_HOUT, L1_WOUT, D1CW_G, D1CW_HPADL, D1CW_HPADR, D1CW_WPADL, D1CW_WPADR,
    D1CW_HSTRIDE, D1CW_WSTRIDE, D1CW_HDILATION, D1CW_WDILATION, D1CW_Scinput,
@ -102,158 +102,158 @@ void q_face_detection_sparse(char* const mem_buf) {
  q15_t_add_vec((Q15_T*)(mem_buf + 153600), D1CB, L1_N, L1_HOUT, L1_WOUT,
    D1CW_COUT, (Q15_T*)(mem_buf + 153600), D1CB_Scten, D1CB_Scvec, D1CB_Scret);

-  q15_convolution((Q15_T*)(mem_buf + 76800), D1LW, (Q15_T*)(mem_buf + 163200),
+  q15_convolution((Q15_T*)mem_buf, D1LW, (Q15_T*)(mem_buf + 168960),
    L1_N, L1_HOUT, L1_WOUT, D1NW_COUT * D1NW_G, D1LW_HF, D1LW_WF, D1LW_CF,
    D1LW_COUT, L1_HOUT, L1_WOUT, D1LW_G, D1LW_HPADL, D1LW_HPADR, D1LW_WPADL,
    D1LW_WPADR, D1LW_HSTRIDE, D1LW_WSTRIDE, D1LW_HDILATION, D1LW_WDILATION,
    D1LW_Scinput, D1LW_Scoutput, D1LW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 163200), D1LB, L1_N, L1_HOUT, L1_WOUT,
-    D1LW_COUT, (Q15_T*)(mem_buf + 163200), D1LB_Scten, D1LB_Scvec, D1LB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 168960), D1LB, L1_N, L1_HOUT, L1_WOUT,
+    D1LW_COUT, (Q15_T*)(mem_buf + 168960), D1LB_Scten, D1LB_Scvec, D1LB_Scret);

-  memset((mem_buf_offset_q15 + 38400), 0, sizeof(Q15_T) * 2400);
-  memset((mem_buf_offset_q15 + 40800), 0, sizeof(Q15_T) * 1);
-  memset((mem_buf_offset_q15 + 40801), 0, sizeof(Q15_T) * 1);
+  memset((mem_buf_offset_q15 + 89344), 0, sizeof(Q15_T) * 2400);
+  memset((mem_buf_offset_q15 + 256), 0, sizeof(Q15_T) * 1);
+  memset((mem_buf_offset_q15 + 384), 0, sizeof(Q15_T) * 1);

  for (ITER_T i = 0; i < 30; i++) {
    for (ITER_T j = 0; j < 40; j++) {
      for (ITER_T k = 0; k < 3; k++) {
-        mem_buf_offset_q15[40805 + k] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + k)];
+        mem_buf_offset_q15[128 + k] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + k)];
      }

      ITER_T index;
-      q15_v_argmax(&mem_buf_offset_q15[40805], 3, &index);
+      q15_v_argmax(&mem_buf_offset_q15[128], 3, &index);

-      mem_buf_offset_q15[38400 + (i * 80 + j * 2)] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + index)];
-      mem_buf_offset_q15[38400 + (i * 80 + j * 2 + 1)] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + 3)];
+      mem_buf_offset_q15[89344 + (i * 80 + j * 2)] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + index)];
+      mem_buf_offset_q15[89344 + (i * 80 + j * 2 + 1)] = mem_buf_offset_q15[76800 + (i * 160 + j * 4 + 3)];
    }
  }

  // MBConv Layer 2
-  q15_mbconv_block((Q15_T*)mem_buf, L2_F1, L2_W1, L2_B1, L2_F2,
-    L2_W2, L2_B2, L2_F3, L2_W3, L2_B3, (Q15_T*)(mem_buf + 81600),
-    (Q15_T*)(mem_buf + 172800), (Q15_T*)(mem_buf + 158656), L2_N, L2_H, L2_W,
+  q15_mbconv_block((Q15_T*)(mem_buf + 76800), L2_F1, L2_W1, L2_B1, L2_F2,
+    L2_W2, L2_B2, L2_F3, L2_W3, L2_B3, (Q15_T*)mem_buf,
+    (Q15_T*)(mem_buf + 153600), (Q15_T*)(mem_buf + 183808), L2_N, L2_H, L2_W,
    L2_CIN, L2_CTEMP, L2_HF, L2_WF, L2_COUT, L2_HOUT, L2_WOUT, L2_HPADL,
    L2_HPADR, L2_WPADL, L2_WPADR, L2_HSTRIDE, L2_WSTRIDE, L2_Limit1, L2_Limit2,
    L2_ShRU1, L2_ShRX1, L2_ShRU2, L2_ShRX2, L2_ShRU3, L2_ShRW3, L2_ShLU1,
    L2_ShLX1, L2_ShLU2, L2_ShLX2, L2_ShLU3, L2_ShLW3);

  // MBConv1 + MBConv2
-  q15_t_add((Q15_T*)mem_buf, (Q15_T*)(mem_buf + 81600), L2_N, L2_HOUT,
-    L2_WOUT, L2_COUT, (Q15_T*)mem_buf, L2_Scten1, L2_Scten2, L2_Scret);
+  q15_t_add((Q15_T*)(mem_buf + 76800), (Q15_T*)mem_buf, L2_N, L2_HOUT,
+    L2_WOUT, L2_COUT, (Q15_T*)(mem_buf + 76800), L2_Scten1, L2_Scten2, L2_Scret);

  // Detection Layer 2 Sub-Pipeline
-  q15_t_l2_norm((Q15_T*)mem_buf, L2_N, L2_HOUT, L2_WOUT, L2_COUT,
-    (Q15_T*)(mem_buf + 81600), D2_ScaleIn, D2_ScaleOut);
+  q15_t_l2_norm((Q15_T*)(mem_buf + 76800), L2_N, L2_HOUT, L2_WOUT, L2_COUT,
+    (Q15_T*)mem_buf, D2_ScaleIn, D2_ScaleOut);

-  q15_convolution((Q15_T*)(mem_buf + 81600), D2NW, (Q15_T*)(mem_buf + 81600),
+  q15_convolution((Q15_T*)mem_buf, D2NW, (Q15_T*)mem_buf,
    L2_N, L2_HOUT, L2_WOUT, L2_COUT, D2NW_HF, D2NW_WF, D2NW_CF, D2NW_COUT,
    L2_HOUT, L2_WOUT, D2NW_G, D2NW_HPADL, D2NW_HPADR, D2NW_WPADL, D2NW_WPADR,
    D2NW_HSTRIDE, D2NW_WSTRIDE, D2NW_HDILATION, D2NW_WDILATION, D2NW_Scinput,
    D2NW_Scoutput, D2NW_Demote);

-  q15_convolution((Q15_T*)(mem_buf + 81600), D2CW, (Q15_T*)(mem_buf + 158400), L2_N,
+  q15_convolution((Q15_T*)mem_buf, D2CW, (Q15_T*)(mem_buf + 163328), L2_N,
    L2_HOUT, L2_WOUT, D2NW_COUT * D2NW_G, D2CW_HF, D2CW_WF, D2CW_CF, D2CW_COUT,
    L2_HOUT, L2_WOUT, D2CW_G, D2CW_HPADL, D2CW_HPADR, D2CW_WPADL, D2CW_WPADR,
    D2CW_HSTRIDE, D2CW_WSTRIDE, D2CW_HDILATION, D2CW_WDILATION, D2CW_Scinput,
    D2CW_Scoutput, D2CW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 158400), D2CB, L2_N, L2_HOUT, L2_WOUT,
-    D2CW_COUT, (Q15_T*)(mem_buf + 158400), D2CB_Scten, D2CB_Scvec, D2CB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 163328), D2CB, L2_N, L2_HOUT, L2_WOUT,
+    D2CW_COUT, (Q15_T*)(mem_buf + 163328), D2CB_Scten, D2CB_Scvec, D2CB_Scret);

-  q15_convolution((Q15_T*)(mem_buf + 81600), D2LW, (Q15_T*)(mem_buf + 172800),
+  q15_convolution((Q15_T*)mem_buf, D2LW, (Q15_T*)(mem_buf + 153600),
    L2_N, L2_HOUT, L2_WOUT, D2NW_COUT * D2NW_G, D2LW_HF, D2LW_WF, D2LW_CF,
    D2LW_COUT, L2_HOUT, L2_WOUT, D2LW_G, D2LW_HPADL, D2LW_HPADR, D2LW_WPADL,
    D2LW_WPADR, D2LW_HSTRIDE, D2LW_WSTRIDE, D2LW_HDILATION, D2LW_WDILATION,
    D2LW_Scinput, D2LW_Scoutput, D2LW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 172800), D2LB, L2_N, L2_HOUT, L2_WOUT,
-    D2LW_COUT, (Q15_T*)(mem_buf + 172800), D2LB_Scten, D2LB_Scvec, D2LB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 153600), D2LB, L2_N, L2_HOUT, L2_WOUT,
+    D2LW_COUT, (Q15_T*)(mem_buf + 153600), D2LB_Scten, D2LB_Scvec, D2LB_Scret);

  // MBConv Layer 3
-  q15_mbconv_block((Q15_T*)mem_buf, L3_F1, L3_W1, L3_B1, L3_F2, L3_W2, L3_B2,
-    L3_F3, L3_W3, L3_B3, (Q15_T*)(mem_buf + 81600), (Q15_T*)(mem_buf + 120000),
-    (Q15_T*)(mem_buf + 135616), L3_N, L3_H, L3_W, L3_CIN, L3_CTEMP, L3_HF,
+  q15_mbconv_block((Q15_T*)(mem_buf + 76800), L3_F1, L3_W1, L3_B1, L3_F2, L3_W2, L3_B2,
+    L3_F3, L3_W3, L3_B3, (Q15_T*)mem_buf, (Q15_T*)(mem_buf + 38400),
+    (Q15_T*)(mem_buf + 54016), L3_N, L3_H, L3_W, L3_CIN, L3_CTEMP, L3_HF,
    L3_WF, L3_COUT, L3_HOUT, L3_WOUT, L3_HPADL, L3_HPADR, L3_WPADL, L3_WPADR,
    L3_HSTRIDE, L3_WSTRIDE, L3_Limit1, L3_Limit2, L3_ShRU1, L3_ShRX1, L3_ShRU2,
    L3_ShRX2, L3_ShRU3, L3_ShRW3, L3_ShLU1, L3_ShLX1, L3_ShLU2, L3_ShLX2,
    L3_ShLU3, L3_ShLW3);

  // Detection Layer 3 Sub-Pipeline
-  q15_t_l2_norm((Q15_T*)(mem_buf + 81600), L3_N, L3_HOUT, L3_WOUT, L3_COUT,
-    (Q15_T*)mem_buf, D3_ScaleIn, D3_ScaleOut);
+  q15_t_l2_norm((Q15_T*)mem_buf, L3_N, L3_HOUT, L3_WOUT, L3_COUT,
+    (Q15_T*)(mem_buf + 38400), D3_ScaleIn, D3_ScaleOut);

-  q15_convolution((Q15_T*)mem_buf, D3NW, (Q15_T*)mem_buf,
+  q15_convolution((Q15_T*)(mem_buf + 38400), D3NW, (Q15_T*)(mem_buf + 38400),
    L3_N, L3_HOUT, L3_WOUT, L3_COUT, D3NW_HF, D3NW_WF, D3NW_CF, D3NW_COUT,
    L3_HOUT, L3_WOUT, D3NW_G, D3NW_HPADL, D3NW_HPADR, D3NW_WPADL, D3NW_WPADR,
    D3NW_HSTRIDE, D3NW_WSTRIDE, D3NW_HDILATION, D3NW_WDILATION, D3NW_Scinput,
    D3NW_Scoutput, D3NW_Demote);

-  q15_convolution((Q15_T*)mem_buf, D3CW, (Q15_T*)(mem_buf + 38400), L3_N,
+  q15_convolution((Q15_T*)(mem_buf + 38400), D3CW, (Q15_T*)(mem_buf + 94720), L3_N,
    L3_HOUT, L3_WOUT, D3NW_COUT * D3NW_G, D3CW_HF, D3CW_WF, D3CW_CF, D3CW_COUT,
    L3_HOUT, L3_WOUT, D3CW_G, D3CW_HPADL, D3CW_HPADR, D3CW_WPADL, D3CW_WPADR,
    D3CW_HSTRIDE, D3CW_WSTRIDE, D3CW_HDILATION, D3CW_WDILATION, D3CW_Scinput,
    D3CW_Scoutput, D3CW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 38400), D3CB, L3_N, L3_HOUT, L3_WOUT,
-    D3CW_COUT, (Q15_T*)(mem_buf + 38400), D3CB_Scten, D3CB_Scvec, D3CB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 94720), D3CB, L3_N, L3_HOUT, L3_WOUT,
+    D3CW_COUT, (Q15_T*)(mem_buf + 94720), D3CB_Scten, D3CB_Scvec, D3CB_Scret);

-  q15_convolution((Q15_T*)mem_buf, D3LW, (Q15_T*)(mem_buf + 39600),
+  q15_convolution((Q15_T*)(mem_buf + 38400), D3LW, (Q15_T*)(mem_buf + 76800),
    L3_N, L3_HOUT, L3_WOUT, D3NW_COUT * D3NW_G, D3LW_HF, D3LW_WF, D3LW_CF,
    D3LW_COUT, L3_HOUT, L3_WOUT, D3LW_G, D3LW_HPADL, D3LW_HPADR, D3LW_WPADL,
    D3LW_WPADR, D3LW_HSTRIDE, D3LW_WSTRIDE, D3LW_HDILATION, D3LW_WDILATION,
    D3LW_Scinput, D3LW_Scoutput, D3LW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 39600), D3LB, L3_N, L3_HOUT, L3_WOUT,
-    D3LW_COUT, (Q15_T*)(mem_buf + 39600), D3LB_Scten, D3LB_Scvec, D3LB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 76800), D3LB, L3_N, L3_HOUT, L3_WOUT,
+    D3LW_COUT, (Q15_T*)(mem_buf + 76800), D3LB_Scten, D3LB_Scvec, D3LB_Scret);

  // MBConv Layer 4
-  q15_mbconv_block((Q15_T*)(mem_buf + 81600), L4_F1, L4_W1, L4_B1, L4_F2,
-    L4_W2, L4_B2, L4_F3, L4_W3, L4_B3, (Q15_T*)mem_buf,
-    (Q15_T*)(mem_buf + 42000), (Q15_T*)(mem_buf + 57872), L4_N, L4_H, L4_W,
+  q15_mbconv_block((Q15_T*)mem_buf, L4_F1, L4_W1, L4_B1, L4_F2,
+    L4_W2, L4_B2, L4_F3, L4_W3, L4_B3, (Q15_T*)(mem_buf + 38400),
+    (Q15_T*)(mem_buf + 79360), (Q15_T*)(mem_buf + 96512), L4_N, L4_H, L4_W,
    L4_CIN, L4_CTEMP, L4_HF, L4_WF, L4_COUT, L4_HOUT, L4_WOUT, L4_HPADL,
    L4_HPADR, L4_WPADL, L4_WPADR, L4_HSTRIDE, L4_WSTRIDE, L4_Limit1, L4_Limit2,
    L4_ShRU1, L4_ShRX1, L4_ShRU2, L4_ShRX2, L4_ShRU3, L4_ShRW3, L4_ShLU1,
    L4_ShLX1, L4_ShLU2, L4_ShLX2, L4_ShLU3, L4_ShLW3);

  // MBConv3 + MBConv4
-  q15_t_add((Q15_T*)(mem_buf + 81600), (Q15_T*)mem_buf, L4_N, L4_HOUT,
-    L4_WOUT, L4_COUT, (Q15_T*)(mem_buf + 81600), L4_Scten1, L4_Scten2, L4_Scret);
+  q15_t_add((Q15_T*)mem_buf, (Q15_T*)(mem_buf + 38400), L4_N, L4_HOUT,
+    L4_WOUT, L4_COUT, (Q15_T*)mem_buf, L4_Scten1, L4_Scten2, L4_Scret);

  // Detection Layer 4 Sub-Pipeline
-  q15_convolution((Q15_T*)(mem_buf + 81600), D4CW, (Q15_T*)(mem_buf + 36000),
+  q15_convolution((Q15_T*)mem_buf, D4CW, (Q15_T*)(mem_buf + 40960),
    L4_N, L4_HOUT, L4_WOUT, L4_COUT, D4CW_HF, D4CW_WF, D4CW_CF, D4CW_COUT,
    L4_HOUT, L4_WOUT, D4CW_G, D4CW_HPADL, D4CW_HPADR, D4CW_WPADL, D4CW_WPADR,
    D4CW_HSTRIDE, D4CW_WSTRIDE, D4CW_HDILATION, D4CW_WDILATION, D4CW_Scinput,
    D4CW_Scoutput, D4CW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 36000), D4CB, L4_N, L4_HOUT, L4_WOUT,
-    D4CW_COUT, (Q15_T*)(mem_buf + 36000), D4CB_Scten, D4CB_Scvec, D4CB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 40960), D4CB, L4_N, L4_HOUT, L4_WOUT,
+    D4CW_COUT, (Q15_T*)(mem_buf + 40960), D4CB_Scten, D4CB_Scvec, D4CB_Scret);

-  q15_convolution((Q15_T*)(mem_buf + 81600), D4LW, (Q15_T*)(mem_buf + 42000),
+  q15_convolution((Q15_T*)mem_buf, D4LW, (Q15_T*)(mem_buf + 45824),
    L4_N, L4_HOUT, L4_WOUT, L4_COUT, D4LW_HF, D4LW_WF, D4LW_CF,
    D4LW_COUT, L4_HOUT, L4_WOUT, D4LW_G, D4LW_HPADL, D4LW_HPADR, D4LW_WPADL,
    D4LW_WPADR, D4LW_HSTRIDE, D4LW_WSTRIDE, D4LW_HDILATION, D4LW_WDILATION,
    D4LW_Scinput, D4LW_Scoutput, D4LW_Demote);

-  q15_t_add_vec((Q15_T*)(mem_buf + 42000), D4LB, L4_N, L4_HOUT, L4_WOUT,
-    D4LW_COUT, (Q15_T*)(mem_buf + 42000), D4LB_Scten, D4LB_Scvec, D4LB_Scret);
+  q15_t_add_vec((Q15_T*)(mem_buf + 45824), D4LB, L4_N, L4_HOUT, L4_WOUT,
+    D4LW_COUT, (Q15_T*)(mem_buf + 45824), D4LB_Scten, D4LB_Scvec, D4LB_Scret);

  // Re-ordering the outputs
  memset(mem_buf, 0, sizeof(Q15_T) * 18000);
-  memcpy(&mem_buf_offset_q15[0], &mem_buf_offset_q15[38400], 2400 * sizeof(Q15_T));
-  memcpy(&mem_buf_offset_q15[2400], &mem_buf_offset_q15[79200], 2400 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[0], &mem_buf_offset_q15[89344], 2400 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[2400], &mem_buf_offset_q15[81664], 2400 * sizeof(Q15_T));

  for (ITER_T i = 0; i < 600; i++) {
-    mem_buf_offset_q15[4800 + i] = (mem_buf_offset_q15[19200 + i] / 2);
+    mem_buf_offset_q15[4800 + i] = (mem_buf_offset_q15[47360 + i] / 2);
  }

-  memcpy(&mem_buf_offset_q15[5400], &mem_buf_offset_q15[18000], 600 * sizeof(Q15_T));
-  memcpy(&mem_buf_offset_q15[6000], &mem_buf_offset_q15[81600], 4800 * sizeof(Q15_T));
-  memcpy(&mem_buf_offset_q15[10800], &mem_buf_offset_q15[86400], 4800 * sizeof(Q15_T));
-  memcpy(&mem_buf_offset_q15[15600], &mem_buf_offset_q15[19800], 1200 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[5400], &mem_buf_offset_q15[20480], 600 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[6000], &mem_buf_offset_q15[84480], 4800 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[10800], &mem_buf_offset_q15[76800], 4800 * sizeof(Q15_T));
+  memcpy(&mem_buf_offset_q15[15600], &mem_buf_offset_q15[38400], 1200 * sizeof(Q15_T));

  for (ITER_T i = 0; i < 1200; i++) {
-    mem_buf_offset_q15[16800 + i] = (mem_buf_offset_q15[21000 + i] / 2);
+    mem_buf_offset_q15[16800 + i] = (mem_buf_offset_q15[22912 + i] / 2);
  }
 }
--- a/c_reference/src/Makefile
+++ b/c_reference/src/Makefile
@ -4,7 +4,7 @@
 include ../config.mk

 INCLUDE_DIR=../include
-IFLAGS = -I $(INCLUDE_DIR)
+IFLAGS=-I $(INCLUDE_DIR)

 all: utils.o fastgrnn.o classifier.o rnnpool.o quantized_utils.o quantized_fastgrnn.o quantized_rnnpool.o quantized_mbconv.o

--- a/c_reference/src/quantized_fastgrnn.c
+++ b/c_reference/src/quantized_fastgrnn.c
@ -1,11 +1,12 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT license.

+#include <string.h>
 #include "quantized_fastgrnn.h"

 int q15_fastgrnn_lr(Q15_T* const hiddenState, ITER_T hiddenDims,
  const Q15_T* const input, ITER_T inputDims, ITER_T steps, const void* params,
-  void* buffers, const void *scales, int backward, int normalize) {
+  void* buffers, const void* scales, int backward, int normalize) {

  const Q15_FastGRNN_LR_Params* tparams = (const Q15_FastGRNN_LR_Params*)params;
  Q15_FastGRNN_LR_Buffers* tbuffers = (Q15_FastGRNN_LR_Buffers*)buffers;
@ -41,31 +42,28 @@ int q15_fastgrnn_lr(Q15_T* const hiddenState, ITER_T hiddenDims,

    // Process the new input and previous hidden state
    q15_m_mulvec(tparams->W1, tbuffers->normFeatures, tparams->wRank,
-      inputDims, tbuffers->tempLRW, tscales->W1, tscales->normFeaturesMVW1,
-      tscales->H1W1, tscales->H2W1);
+      inputDims, tbuffers->tempLRW, tscales->w1, tscales->normFeaturesMVW1,
+      tscales->mVW1Out);
    q15_m_mulvec(tparams->W2, tbuffers->tempLRW, hiddenDims, tparams->wRank,
-      tbuffers->preComp1, tscales->W2, tscales->tempLRW, tscales->H1W2,
-      tscales->H2W2);
+      tbuffers->preComp1, tscales->w2, tscales->tempLRW, tscales->mVW2Out);
    q15_m_mulvec(tparams->U1, hiddenState, tparams->uRank, hiddenDims,
-      tbuffers->tempLRU, tscales->U1, tscales->hiddenStateMVU1,tscales->H1U1,
-      tscales->H2U1);
+      tbuffers->tempLRU, tscales->u1, tscales->hiddenStateMVU1,tscales->mVU1Out);
    q15_m_mulvec(tparams->U2, tbuffers->tempLRU, hiddenDims, tparams->uRank,
-      tbuffers->preComp2, tscales->U2, tscales->tempLRU, tscales->H1U2,
-      tscales->H2U2);
+      tbuffers->preComp2, tscales->u2, tscales->tempLRU, tscales->mVU2Out);
    q15_v_add(tbuffers->preComp1, tbuffers->preComp2, hiddenDims,
      tbuffers->preComp1, tscales->mV2AddMV4, tscales->mV4AddMV2,
      tscales->mV2AddMV4Out, tscales->mV2AddMV4Demote);

    // Apply the gate to generate the new hidden state
    q15_v_add(tbuffers->preComp1, tparams->Bg, hiddenDims, tbuffers->preComp2,
-      tscales->pC1AddBg, tscales->Bg, tscales->pC1AddBgOut,
+      tscales->pC1AddBg, tscales->bg, tscales->pC1AddBgOut,
      tscales->pC1AddBgDemote);
    q15_v_sigmoid(tbuffers->preComp2, hiddenDims, tbuffers->preComp2,
      tscales->div, tscales->add, tscales->sigmoidLimit,
      tscales->sigmoidScaleIn, tscales->sigmoidScaleOut,
      tscales->useTableSigmoid);
    q15_v_add(tbuffers->preComp1, tparams->Bh, hiddenDims, tbuffers->preComp1,
-      tscales->pC1AddBh, tscales->Bh, tscales->pC1AddBhOut,
+      tscales->pC1AddBh, tscales->bh, tscales->pC1AddBhOut,
      tscales->pC1AddBhDemote);
    q15_v_tanh(tbuffers->preComp1, hiddenDims, tbuffers->preComp1,
      tscales->tanhScaleIn, tscales->tanhScaleOut, tscales->useTableTanH);
@ -126,19 +124,20 @@ int q7xq15_q15_fastgrnn(Q15_T* const hiddenState, ITER_T hiddenDims,

    // Process the new input and previous hidden state
    #ifdef SPARSE
+      memset(tbuffers->preComp1, 0, hiddenDims * sizeof(Q15_T));
+      memset(tbuffers->preComp2, 0, hiddenDims * sizeof(Q15_T));
      q15xq7_q15_m_sparse_mulvec(tparams->Wids, tparams->Wvals,
-        tbuffers->normFeatures, hiddenDims, inputDims, tbuffers->preComp1,
-        tscales->W, tscales->normFeaturesMVW, tscales->H1W, tscales->H2W);
+        tbuffers->normFeatures, inputDims, tbuffers->preComp1,
+        tscales->w, tscales->normFeaturesMVW, tscales->mVWOut);
      q15_m_sparse_mulvec(tparams->Uids, tparams->Uvals, hiddenState,
-        hiddenDims, hiddenDims, tbuffers->preComp2, tscales->U,
-        tscales->hiddenStateMVU, tscales->H1U, tscales->H2U);
+        hiddenDims, tbuffers->preComp2, tscales->u, tscales->hiddenStateMVU,
+        tscales->mVUOut);
    #else
      q15xq7_q15_m_mulvec(tparams->W, tbuffers->normFeatures, hiddenDims,
-        inputDims, tbuffers->preComp1, tscales->W, tscales->normFeaturesMVW,
-        tscales->H1W, tscales->H2W);
+        inputDims, tbuffers->preComp1, tscales->w, tscales->normFeaturesMVW,
+        tscales->mVWOut);
      q15_m_mulvec(tparams->U, hiddenState, hiddenDims, hiddenDims,
-        tbuffers->preComp2, tscales->U, tscales->hiddenStateMVU, tscales->H1U,
-        tscales->H2U);
+        tbuffers->preComp2, tscales->u, tscales->hiddenStateMVU, tscales->mVUOut);
    #endif
    q15_v_add(tbuffers->preComp1, tbuffers->preComp2, hiddenDims,
      tbuffers->preComp1, tscales->mV1AddMV2, tscales->mV2AddMV1,
@ -146,14 +145,14 @@ int q7xq15_q15_fastgrnn(Q15_T* const hiddenState, ITER_T hiddenDims,

    // Apply the gate to generate the new hidden state
    q15_v_add(tbuffers->preComp1, tparams->Bg, hiddenDims, tbuffers->preComp2,
-      tscales->pC1AddBg, tscales->Bg, tscales->pC1AddBgOut,
+      tscales->pC1AddBg, tscales->bg, tscales->pC1AddBgOut,
      tscales->pC1AddBgDemote);
    q15_v_sigmoid(tbuffers->preComp2, hiddenDims, tbuffers->preComp2,
      tscales->div, tscales->add, tscales->sigmoidLimit,
      tscales->sigmoidScaleIn, tscales->sigmoidScaleOut,
      tscales->useTableSigmoid);
    q15_v_add(tbuffers->preComp1, tparams->Bh, hiddenDims, tbuffers->preComp1,
-      tscales->pC1AddBh, tscales->Bh, tscales->pC1AddBhOut,
+      tscales->pC1AddBh, tscales->bh, tscales->pC1AddBhOut,
      tscales->pC1AddBhDemote);
    q15_v_tanh(tbuffers->preComp1, hiddenDims, tbuffers->preComp1,
      tscales->tanhScaleIn, tscales->tanhScaleOut, tscales->useTableTanH);
@ -214,19 +213,19 @@ int q15_fastgrnn(Q15_T* const hiddenState, ITER_T hiddenDims,

    // Process the new input and previous hidden state
    #ifdef SPARSE
+      memset(tbuffers->preComp1, 0, hiddenDims * sizeof(Q15_T));
+      memset(tbuffers->preComp2, 0, hiddenDims * sizeof(Q15_T));
      q15_m_sparse_mulvec(tparams->Wids, tparams->Wvals, tbuffers->normFeatures,
-        hiddenDims, inputDims, tbuffers->preComp1, tscales->W,
-        tscales->normFeaturesMVW, tscales->H1W, tscales->H2W);
+        inputDims, tbuffers->preComp1, tscales->w, tscales->normFeaturesMVW,
+        tscales->mVWOut);
      q15_m_sparse_mulvec(tparams->Uids, tparams->Uvals, hiddenState,
-        hiddenDims, hiddenDims, tbuffers->preComp2, tscales->U,
-        tscales->hiddenStateMVU, tscales->H1U, tscales->H2U);
+        hiddenDims, tbuffers->preComp2, tscales->u, tscales->hiddenStateMVU,
+        tscales->mVUOut);
    #else
      q15_m_mulvec(tparams->W, tbuffers->normFeatures, hiddenDims, inputDims,
-        tbuffers->preComp1, tscales->W, tscales->normFeaturesMVW, tscales->H1W,
-        tscales->H2W);
+        tbuffers->preComp1, tscales->w, tscales->normFeaturesMVW, tscales->mVWOut);
      q15_m_mulvec(tparams->U, hiddenState, hiddenDims, hiddenDims,
-        tbuffers->preComp2, tscales->U, tscales->hiddenStateMVU, tscales->H1U,
-        tscales->H2U);
+        tbuffers->preComp2, tscales->u, tscales->hiddenStateMVU, tscales->mVUOut);
    #endif
    q15_v_add(tbuffers->preComp1, tbuffers->preComp2, hiddenDims,
      tbuffers->preComp1, tscales->mV1AddMV2, tscales->mV2AddMV1,
@ -234,14 +233,14 @@ int q15_fastgrnn(Q15_T* const hiddenState, ITER_T hiddenDims,

    // Apply the gate to generate the new hidden state
    q15_v_add(tbuffers->preComp1, tparams->Bg, hiddenDims, tbuffers->preComp2,
-      tscales->pC1AddBg, tscales->Bg, tscales->pC1AddBgOut,
+      tscales->pC1AddBg, tscales->bg, tscales->pC1AddBgOut,
      tscales->pC1AddBgDemote);
    q15_v_sigmoid(tbuffers->preComp2, hiddenDims, tbuffers->preComp2,
      tscales->div, tscales->add, tscales->sigmoidLimit,
      tscales->sigmoidScaleIn, tscales->sigmoidScaleOut,
      tscales->useTableSigmoid);
    q15_v_add(tbuffers->preComp1, tparams->Bh, hiddenDims, tbuffers->preComp1,
-      tscales->pC1AddBh, tscales->Bh, tscales->pC1AddBhOut,
+      tscales->pC1AddBh, tscales->bh, tscales->pC1AddBhOut,
      tscales->pC1AddBhDemote);
    q15_v_tanh(tbuffers->preComp1, hiddenDims, tbuffers->preComp1,
      tscales->tanhScaleIn, tscales->tanhScaleOut, tscales->useTableTanH);
--- a/c_reference/src/quantized_utils.c
+++ b/c_reference/src/quantized_utils.c
@ -2,7 +2,6 @@
 // Licensed under the MIT license.

 #include <stddef.h>
-#include <string.h>
 #include "quantized_utils.h"

 void q15_v_add(const Q15_T* vec1, const Q15_T* vec2, ITER_T len, Q15_T* ret,
@ -497,14 +496,46 @@ void q15_v_scale_down(const Q15_T* vec, ITER_T len, Q15_T* ret, SCALE_T scvec) {
  }
 }

+void q15_m_reverse(const Q15_T* const mat, ITER_T nrows, ITER_T ncols,
+                   ITER_T axis, Q15_T* const ret) {
+  ITER_T len = nrows * ncols;
+
+  if (axis == 0) {
+    ITER_T col_counter = 0, row_index = len - ncols;
+
+    for (ITER_T i = 0; i < len; i++) {
+      if (col_counter >= ncols) {
+        col_counter = 0;
+        row_index -= ncols;
+      }
+
+      ret[i] = mat[row_index + col_counter];
+      col_counter++;
+    }
+  } else {
+    S_ITER_T row_counter = ncols - 1;
+    ITER_T col_index = 0;
+
+    for (ITER_T i = 0; i < len; i++) {
+      if (row_counter < 0) {
+        row_counter = ncols - 1;
+        col_index += ncols;
+      }
+
+      ret[i] = mat[col_index + (ITER_T)row_counter];
+      row_counter--;
+    }
+  }
+}
+
 void q15xq7_q15_m_mulvec(const Q15_T* mat, const Q7_T* const vec, ITER_T nrows,
                         ITER_T ncols, Q15_T* ret, SCALE_T scmat,
-                         SCALE_T scvec, SCALE_T H1, SCALE_T H2) {
+                         SCALE_T scvec, SCALE_T scret) {
  Q31_T sum;
  #ifdef SHIFT
-    SCALE_T scale = scmat + scvec + H1;
+    SCALE_T scale = scmat + scvec + scret;
  #else
-    SCALE_T scale = scmat * scvec * H1;
+    SCALE_T scale = scmat * scvec * scret;
  #endif

  while (nrows--) {
@ -537,15 +568,15 @@ void q15xq7_q15_m_mulvec(const Q15_T* mat, const Q7_T* const vec, ITER_T nrows,

 void q15_m_mulvec(const Q15_T* mat, const Q15_T* const vec, ITER_T nrows,
                  ITER_T ncols, Q15_T* ret, SCALE_T scmat, SCALE_T scvec,
-                  SCALE_T H1, SCALE_T H2) {
+                  SCALE_T scret) {
  Q63_T sum;
  #ifdef SHIFT
-    SCALE_T scale = scmat + scvec + H1;
+    SCALE_T scale = scmat + scvec + scret;
  #else
    // Be careful, the below implementation would not work if the denominator
    // exceeds the range of Q31_T range. In such a case, cast the denominator
    // to int64_t.
-    SCALE_T scale = scmat * scvec * H1;
+    SCALE_T scale = scmat * scvec * scret;
  #endif

  while (nrows--) {
@ -578,22 +609,20 @@ void q15_m_mulvec(const Q15_T* mat, const Q15_T* const vec, ITER_T nrows,

 void q15xq7_q15_m_sparse_mulvec(const ITER_T* row_indices,
                                const Q15_T* mat_values, const Q7_T* vec,
-                                ITER_T nrows, ITER_T ncols, Q15_T* ret,
-                                SCALE_T scmat, SCALE_T scvec, SCALE_T H1,
-                                SCALE_T H2) {
+                                ITER_T nelem, Q15_T* ret, SCALE_T scmat,
+                                SCALE_T scvec, SCALE_T scret) {
  ITER_T index;
  Q31_T vec_offset;
-  memset(ret, 0, nrows * sizeof(Q15_T));
  #ifdef SHIFT
-    SCALE_T scale = scmat + scvec + H1;
+    SCALE_T scale = scmat + scvec + scret;
  #else
    // Be careful, the below implementation would not work if the denominator
    // exceeds the range of Q31_T range. In such a case, cast the denominator
    // to int64_t.
-    SCALE_T scale = scmat * scvec * H1;
+    SCALE_T scale = scmat * scvec * scret;
  #endif

-  while (ncols--) {
+  while (nelem--) {
    index = *row_indices++;
    vec_offset = *vec++;

@ -609,22 +638,20 @@ void q15xq7_q15_m_sparse_mulvec(const ITER_T* row_indices,
 }

 void q15_m_sparse_mulvec(const ITER_T* row_indices, const Q15_T* mat_values,
-                         const Q15_T* vec, ITER_T nrows, ITER_T ncols,
-                         Q15_T* ret, SCALE_T scmat, SCALE_T scvec, SCALE_T H1,
-                         SCALE_T H2) {
+                         const Q15_T* vec, ITER_T nelem, Q15_T* ret,
+                         SCALE_T scmat, SCALE_T scvec, SCALE_T scret) {
  ITER_T index;
  Q31_T vec_offset;
-  memset(ret, 0, nrows * sizeof(Q15_T));
  #ifdef SHIFT
-    SCALE_T scale = scmat + scvec + H1;
+    SCALE_T scale = scmat + scvec + scret;
  #else
    // Be careful, the below implementation would not work if the denominator
    // exceeds the range of Q31_T range. In such a case, cast the denominator
    // to int64_t.
-    SCALE_T scale = scmat * scvec * H1;
+    SCALE_T scale = scmat * scvec * scret;
  #endif

-  while (ncols--) {
+  while (nelem--) {
    index = *row_indices++;
    vec_offset = *vec++;

--- a/c_reference/tests/Makefile
+++ b/c_reference/tests/Makefile
@ -6,7 +6,7 @@ include ../config.mk
 INCLUDE_DIR=../include
 MODEL_DIR=../models
 SRC_DIR=../src
-IFLAGS = -I $(INCLUDE_DIR) -I $(MODEL_DIR)
+IFLAGS=-I $(INCLUDE_DIR) -I $(MODEL_DIR)

 all: test_fastgrnn_lr test_rnnpool test_quantized_utils test_quantized_fastgrnn test_quantized_rnnpool test_quantized_mbconv test_quantized_face_detection test_quantized_face_detection_fast test_quantized_face_detection_sparse

--- a/c_reference/tests/face_detection/test_quantized_face_detection.c
+++ b/c_reference/tests/face_detection/test_quantized_face_detection.c
@ -23,9 +23,9 @@
 #endif

 // Comparator function for sorting doubles.
-int compare_doubles(const void *a, const void *b) {
-  const double *da = (const double *) a;
-  const double *db = (const double *) b;
+int compare_doubles(const void* a, const void* b) {
+  const double* da = (const double*) a;
+  const double* db = (const double*) b;

  return (*da > *db) - (*da < *db);
 }
@ -57,7 +57,7 @@ double aggregate_error(double* errors, unsigned len) {
 /**
 *  By default, all tests run without using bit-shifting operations.
 */
-int main(int argc, char **argv) {
+int main(int argc, char** argv) {
  unsigned patches;
  SCALE_T XScale = -1, YScale = 12;
  FILE *xFile, *yFile, *floatResFile, *outputLog;
@ -152,11 +152,11 @@ int main(int argc, char **argv) {
  char* mem_buf = malloc(MEM_BUF_SIZE * sizeof(char));
  double* xLine = malloc(INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH * sizeof(double));
  double* yLine = malloc(OUTPUT_SIZE * sizeof(double));
-  double* allErrors = malloc(patches * OUTPUT_SIZE * (sizeof(double)));
+  double* allErrors = malloc(patches * OUTPUT_SIZE * sizeof(double));

  float time_spent = 0.0;
-  Q7_T* mem_buf_input_offset = (Q7_T *)mem_buf;
-  Q15_T* mem_buf_output_offset = (Q15_T *)(mem_buf + INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH);
+  Q7_T* mem_buf_input_offset = (Q7_T*)mem_buf;
+  Q15_T* mem_buf_output_offset = (Q15_T*)(mem_buf + INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH);
  for (unsigned i = 0; i < patches; i++) {
    fread(&yLine[0], sizeof(double), OUTPUT_SIZE, floatResFile);
    fread(&xLine[0], sizeof(double), INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH, xFile);
--- a/c_reference/tests/face_detection/test_quantized_face_detection_fast.c
+++ b/c_reference/tests/face_detection/test_quantized_face_detection_fast.c
@ -23,9 +23,9 @@
 #endif

 // Comparator function for sorting floats.
-int compare_floats(const void *a, const void *b) {
-  const float *da = (const float *) a;
-  const float *db = (const float *) b;
+int compare_floats(const void* a, const void* b) {
+  const float* da = (const float*) a;
+  const float* db = (const float*) b;

  return (*da > *db) - (*da < *db);
 }
@ -57,7 +57,7 @@ float aggregate_error(float* errors, unsigned len) {
 /**
 *  By default, all tests run without using bit-shifting operations.
 */
-int main(int argc, char **argv) {
+int main(int argc, char** argv) {
  unsigned patches;
  SCALE_T XScale = -1, YScale = 12;
  FILE *xFile, *yFile, *floatResFile, *outputLog;
@ -152,11 +152,11 @@ int main(int argc, char **argv) {
  char* mem_buf = malloc(MEM_BUF_SIZE * sizeof(char));
  float* xLine = malloc(INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH * sizeof(float));
  float* yLine = malloc(OUTPUT_SIZE * sizeof(float));
-  float* allErrors = malloc(patches * OUTPUT_SIZE * (sizeof(float)));
+  float* allErrors = malloc(patches * OUTPUT_SIZE * sizeof(float));

  float time_spent = 0.0;
-  Q7_T* mem_buf_input_offset = (Q7_T *)mem_buf;
-  Q15_T* mem_buf_output_offset = (Q15_T *)(mem_buf + 2 * OUTPUT_SIZE);
+  Q7_T* mem_buf_input_offset = (Q7_T*)mem_buf;
+  Q15_T* mem_buf_output_offset = (Q15_T*)(mem_buf + 2 * OUTPUT_SIZE);
  for (unsigned i = 0; i < patches; i++) {
    fread(&yLine[0], sizeof(float), OUTPUT_SIZE, floatResFile);
    fread(&xLine[0], sizeof(float), INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH, xFile);
--- a/c_reference/tests/face_detection/test_quantized_face_detection_sparse.c
+++ b/c_reference/tests/face_detection/test_quantized_face_detection_sparse.c
@ -10,7 +10,7 @@
 #include "quantized_datatypes.h"
 #include "quantized_face_detection_sparse.h"

-#define MEM_BUF_SIZE 188160
+#define MEM_BUF_SIZE 184576
 #define INPUT_IMG_HEIGHT 240
 #define INPUT_IMG_WIDTH 320
 #define OUTPUT_SIZE 18000
@ -23,9 +23,9 @@
 #endif

 // Comparator function for sorting doubles.
-int compare_doubles(const void *a, const void *b) {
-  const double *da = (const double *) a;
-  const double *db = (const double *) b;
+int compare_doubles(const void* a, const void* b) {
+  const double* da = (const double*) a;
+  const double* db = (const double*) b;

  return (*da > *db) - (*da < *db);
 }
@ -57,7 +57,7 @@ double aggregate_error(double* errors, unsigned len) {
 /**
 *  By default, all tests run without using bit-shifting operations.
 */
-int main(int argc, char **argv) {
+int main(int argc, char** argv) {
  unsigned patches;
  SCALE_T XScale = -1, YScale = 12;
  FILE *xFile, *yFile, *floatResFile, *outputLog;
@ -152,11 +152,11 @@ int main(int argc, char **argv) {
  char* mem_buf = malloc(MEM_BUF_SIZE * sizeof(char));
  double* xLine = malloc(INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH * sizeof(double));
  double* yLine = malloc(OUTPUT_SIZE * sizeof(double));
-  double* allErrors = malloc(patches * OUTPUT_SIZE * (sizeof(double)));
+  double* allErrors = malloc(patches * OUTPUT_SIZE * sizeof(double));

  float time_spent = 0.0;
-  Q7_T* mem_buf_input_offset = (Q7_T *)mem_buf;
-  Q15_T* mem_buf_output_offset = (Q15_T *)mem_buf;
+  Q7_T* mem_buf_input_offset = (Q7_T*)(mem_buf + INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH);
+  Q15_T* mem_buf_output_offset = (Q15_T*)mem_buf;
  for (unsigned i = 0; i < patches; i++) {
    fread(&yLine[0], sizeof(double), OUTPUT_SIZE, floatResFile);
    fread(&xLine[0], sizeof(double), INPUT_IMG_HEIGHT * INPUT_IMG_WIDTH, xFile);
--- a/c_reference/tests/rnnpool/q_wider_regression_model/rnn1.h
+++ b/c_reference/tests/rnnpool/q_wider_regression_model/rnn1.h
@ -44,27 +44,25 @@ static Q15_FastGRNN_Buffers rnn1_buffers = {
    .meanSub = 0,
    .stdDev = 0,
    .normFeaturesHDStdDev = 0,
-    .W = 7, //128
+    .w = 7, //128
    .normFeaturesMVW = 6, //64
-    .H1W = 2,
-    .H2W = 0,
-    .U = 7, //128
+    .mVWOut = 2,
+    .u = 7, //128
    .hiddenStateMVU = 6, //64
-    .H1U = 3,
-    .H2U = 0,
+    .mVUOut = 3,
    .mV1AddMV2 = 0, //1
    .mV2AddMV1 = 2, //4
    .mV1AddMV2Out = 0, //1
    .mV1AddMV2Demote = 0, //1
    .pC1AddBg = 0, //1
-    .Bg = 3, //8
+    .bg = 3, //8
    .pC1AddBgOut = 0, //1
    .pC1AddBgDemote = 0, //1
    .sigmoidLimit = 2048,
    .sigmoidScaleIn = 11, //2048
    .sigmoidScaleOut = 14, //16384
    .pC1AddBh = 0, //1
-    .Bh = 4, //16
+    .bh = 4, //16
    .pC1AddBhOut = 0, //1
    .pC1AddBhDemote = 0, //1
    .tanhScaleIn = 11, //2048
@ -100,27 +98,25 @@ static Q15_FastGRNN_Buffers rnn1_buffers = {
    .meanSub = 0,
    .stdDev = 0,
    .normFeaturesHDStdDev = 0,
-    .W = 128,
+    .w = 128,
    .normFeaturesMVW = 64,
-    .H1W = 4,
-    .H2W = 0,
-    .U = 128,
+    .mVWOut = 4,
+    .u = 128,
    .hiddenStateMVU = 64,
-    .H1U = 8,
-    .H2U = 0,
+    .mVUOut = 8,
    .mV1AddMV2 = 1,
    .mV2AddMV1 = 4,
    .mV1AddMV2Out = 1,
    .mV1AddMV2Demote = 1,
    .pC1AddBg = 1,
-    .Bg = 8,
+    .bg = 8,
    .pC1AddBgOut = 1,
    .pC1AddBgDemote = 1,
    .sigmoidLimit = 2048,
    .sigmoidScaleIn = 11, //2048
    .sigmoidScaleOut = 14, //16384
    .pC1AddBh = 1,
-    .Bh = 16,
+    .bh = 16,
    .pC1AddBhOut = 1,
    .pC1AddBhDemote = 1,
    .tanhScaleIn = 11, //2048
--- a/c_reference/tests/rnnpool/q_wider_regression_model/rnn2.h
+++ b/c_reference/tests/rnnpool/q_wider_regression_model/rnn2.h
@ -42,27 +42,25 @@ static Q15_FastGRNN_Buffers rnn2_buffers = {
    .meanSub = 0,
    .stdDev = 0,
    .normFeaturesHDStdDev = 0,
-    .W = 6, //64
+    .w = 6, //64
    .normFeaturesMVW = 5, //32
-    .H1W = 3,
-    .H2W = 0,
-    .U = 6, //64
+    .mVWOut = 3,
+    .u = 6, //64
    .hiddenStateMVU = 6, //64
-    .H1U = 3,
-    .H2U = 0,
+    .mVUOut = 3,
    .mV1AddMV2 = 1, //2
    .mV2AddMV1 = 0, //1
    .mV1AddMV2Out = 0, //1
    .mV1AddMV2Demote = 0, //1
    .pC1AddBg = 0, //1
-    .Bg = 1, //2
+    .bg = 1, //2
    .pC1AddBgOut = 0, //1
    .pC1AddBgDemote = 0, //1
    .sigmoidLimit = 8192,
    .sigmoidScaleIn = 13, //8192
    .sigmoidScaleOut = 14, //16384
    .pC1AddBh = 0, //1
-    .Bh = 1, //2
+    .bh = 1, //2
    .pC1AddBhOut = 0, //1
    .pC1AddBhDemote = 0, //1
    .tanhScaleIn = 13, //8192
@ -98,27 +96,25 @@ static Q15_FastGRNN_Buffers rnn2_buffers = {
    .meanSub = 0,
    .stdDev = 0,
    .normFeaturesHDStdDev = 0,
-    .W = 64,
+    .w = 64,
    .normFeaturesMVW = 32,
-    .H1W = 8,
-    .H2W = 0,
-    .U = 64,
+    .mVWOut = 8,
+    .u = 64,
    .hiddenStateMVU = 64,
-    .H1U = 8,
-    .H2U = 0,
+    .mVUOut = 8,
    .mV1AddMV2 = 2,
    .mV2AddMV1 = 1,
    .mV1AddMV2Out = 1,
    .mV1AddMV2Demote = 1,
    .pC1AddBg = 1,
-    .Bg = 2,
+    .bg = 2,
    .pC1AddBgOut = 1,
    .pC1AddBgDemote = 1,
    .sigmoidLimit = 8192,
    .sigmoidScaleIn = 13, //8192
    .sigmoidScaleOut = 14, //16384
    .pC1AddBh = 1,
-    .Bh = 2,
+    .bh = 2,
    .pC1AddBhOut = 1,
    .pC1AddBhDemote = 1,
    .tanhScaleIn = 13, //8192
--- a/c_reference/tests/rnnpool/test_quantized_rnnpool.c
+++ b/c_reference/tests/rnnpool/test_quantized_rnnpool.c
@ -151,7 +151,7 @@ int main(int argc, char **argv) {
    fread(&yLine[0], sizeof(float), 4 * HIDDEN_DIM2, floatResFile);
    Q15_T reshapedXLine[INPUT_CHANNELS * PATCH_DIM * PATCH_DIM];

-    for (unsigned a = 0; a < INPUT_CHANNELS; a ++) {
+    for (unsigned a = 0; a < INPUT_CHANNELS; a++) {
      for (unsigned b = 0; b < PATCH_DIM; b++) {
        for (unsigned c = 0; c < PATCH_DIM; c++) {
          reshapedXLine[b * PATCH_DIM * INPUT_CHANNELS + c * INPUT_CHANNELS + a] =
--- a/c_reference/tests/utils/test_quantized_utils.c
+++ b/c_reference/tests/utils/test_quantized_utils.c
@ -235,10 +235,10 @@ int test_q15xq7_q15_m_mulvec() {

  #ifdef SHIFT
    const Q15_T expected[8] = {-15, 8, -68, 18, -38, -30, 17, 12};
-    q15xq7_q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 7, 6, 2, 0);
+    q15xq7_q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 7, 6, 2);
  #else
    const Q15_T expected[8] = {-14, 8, -67, 18, -37, -29, 17, 12};
-    q15xq7_q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 128, 64, 4, 0);
+    q15xq7_q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 128, 64, 4);
  #endif

  return check_output_q15(pred, expected, 8);
@ -252,10 +252,10 @@ int test_q15_m_mulvec() {

  #ifdef SHIFT
    const Q15_T expected[8] = {-426, -170, -3535, 524, -2740, 87, 52, 292};
-    q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 7, 6, 2, 0);
+    q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 7, 6, 2);
  #else
    const Q15_T expected[8] = {-425, -169, -3534, 524, -2739, 87, 52, 292};
-    q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 128, 64, 4, 0);
+    q15_m_mulvec(&qmat_A[0], &qvec_B[0], 8, 4, &pred[0], 128, 64, 4);
  #endif

  return check_output_q15(pred, expected, 8);
@ -267,12 +267,12 @@ int test_q15xq7_q15_m_sparse_mulvec() {
  const Q15_T qmat_values[4] = {23, 48, 32, 1};
  const Q7_T qvec_A[3] = {1, 2, 3};
  const Q15_T expected[3] = {87, 3, 48};
-  Q15_T pred[3];
+  Q15_T pred[3] = {};

  #ifdef SHIFT
-    q15xq7_q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, 3, &pred[0], 0, 0, 0, 0);
+    q15xq7_q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, &pred[0], 0, 0, 0);
  #else
-    q15xq7_q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, 3, &pred[0], 1, 1, 1, 0);
+    q15xq7_q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, &pred[0], 1, 1, 1);
  #endif

  return check_output_q15(pred, expected, 3);
@ -284,12 +284,12 @@ int test_q15_m_sparse_mulvec() {
  const Q15_T qmat_values[4] = {23, 48, 32, 1};
  const Q15_T qvec_A[3] = {1, 2, 3};
  const Q15_T expected[3] = {87, 3, 48};
-  Q15_T pred[3];
+  Q15_T pred[3] = {};

  #ifdef SHIFT
-    q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, 3, &pred[0], 0, 0, 0, 0);
+    q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, &pred[0], 0, 0, 0);
  #else
-    q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, 3, &pred[0], 1, 1, 1, 0);
+    q15_m_sparse_mulvec(&qrow_indices[0], &qmat_values[0], &qvec_A[0], 3, &pred[0], 1, 1, 1);
  #endif

  return check_output_q15(pred, expected, 3);
--- a/examples/pytorch/vision/Face_Detection/convert_RPool_Face_to_SeeDot.py
+++ b/examples/pytorch/vision/Face_Detection/convert_RPool_Face_to_SeeDot.py
@ -0,0 +1,172 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import os
+import torch
+import numpy as np
+import argparse
+
+parser = argparse.ArgumentParser(description='Converting RPool_Face_* model to SeeDot style')
+parser.add_argument('--model', type=str,
+                     help='path to trained model')
+parser.add_argument('--model_arch', type=str,
+                    choices=['RPool_Face_QVGA_monochrome', 'RPool_Face_M4'],
+                    help='choose architecture among RPool variants')
+args = parser.parse_args()
+
+if args.model_arch == 'RPool_Face_QVGA_monochrome':
+	from models import RPool_Face_QVGA_monochrome as module
+elif args.model_arch == 'RPool_Face_M4':
+	from models import RPool_Face_M4 as module
+
+if args.model_arch == 'RPool_Face_QVGA_monochrome':
+	save_dir_model = '../../../../tools/SeeDot/model/rnnpool/face-2/'
+	save_dir_datasets = '../../../../tools/SeeDot/datasets/rnnpool/face-2/'
+elif args.model_arch == 'RPool_Face_M4':
+	save_dir_model = '../../../../tools/SeeDot/model/rnnpool/face-4/'
+	save_dir_datasets = '../../../../tools/SeeDot/datasets/rnnpool/face-4/'
+
+if not os.path.exists(save_dir_model):
+    os.makedirs(save_dir_model)
+if not os.path.exists(save_dir_datasets):
+    os.makedirs(save_dir_datasets)
+
+net = module.build_s3fd('test', num_classes = 2)
+
+checkpoint_dict = torch.load(args.model)
+
+model_dict = {}
+net = torch.nn.DataParallel(net)
+model_dict = net.state_dict()
+
+model_dict.update(checkpoint_dict)
+net.load_state_dict(model_dict, strict = False)
+net.eval()
+
+a = np.load('trace_inputs.npy')
+a = np.squeeze(a, axis=1)
+# a = np.squeeze(a, axis=1)
+a = a[0].flatten()
+
+b = np.load('trace_outputs.npy')
+b = b[0].flatten()
+b = np.concatenate((b, a), axis=0)
+
+np.save(save_dir_datasets + 'train.npy', b.reshape(1, b.shape[0]))
+np.save(save_dir_datasets + 'test.npy', b.reshape(1, b.shape[0]))
+
+C1 = net.state_dict()['module.conv.0.weight']
+C1m = C1.permute(2, 3, 1, 0).detach().cpu().numpy().flatten()
+w = net.state_dict()['module.conv.1.weight']
+b = net.state_dict()['module.conv.1.bias']
+m = net.state_dict()['module.conv.1.running_mean']
+v = net.state_dict()['module.conv.1.running_var']
+BNW = torch.mul(torch.rsqrt(torch.add(v, 0.00001)), w)
+BNB = torch.sub(torch.mul(b, torch.reciprocal(BNW)), m)
+
+np.save(save_dir_model + 'CBR1F.npy', C1m.reshape(1, C1m.shape[0]))
+np.save(save_dir_model + 'CBR1W.npy', BNW.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+np.save(save_dir_model + 'CBR1B.npy', BNB.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+
+W1 = net.state_dict()['module.rnn_model.cell_rnn.cell.W']
+W1m = W1.permute(1, 0)
+U1 = net.state_dict()['module.rnn_model.cell_rnn.cell.U']
+U1m = U1.permute(1, 0)
+Bg1 = net.state_dict()['module.rnn_model.cell_rnn.cell.bias_gate']
+Bg1m = Bg1.permute(1, 0)
+Bh1 = net.state_dict()['module.rnn_model.cell_rnn.cell.bias_update']
+Bh1m = Bh1.permute(1, 0)
+zeta1 = net.state_dict()['module.rnn_model.cell_rnn.cell.zeta']
+nu1 = net.state_dict()['module.rnn_model.cell_rnn.cell.nu']
+
+np.save(save_dir_model + 'W1.npy', W1m.detach().cpu().numpy())
+np.save(save_dir_model + 'U1.npy', U1m.detach().cpu().numpy())
+np.save(save_dir_model + 'Bg1.npy', Bg1m.detach().cpu().numpy())
+np.save(save_dir_model + 'Bh1.npy', Bh1m.detach().cpu().numpy())
+np.save(save_dir_model + 'zeta1.npy', zeta1.detach().cpu().numpy().item())
+np.save(save_dir_model + 'nu1.npy', nu1.detach().cpu().numpy().item())
+
+W2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.W']
+W2m = W2.permute(1, 0)
+U2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.U']
+U2m = U2.permute(1, 0)
+Bg2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.bias_gate']
+Bg2m = Bg2.permute(1, 0)
+Bh2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.bias_update']
+Bh2m = Bh2.permute(1, 0)
+zeta2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.zeta']
+nu2 = net.state_dict()['module.rnn_model.cell_bidirrnn.cell.nu']
+
+np.save(save_dir_model + 'W2.npy', W2m.detach().cpu().numpy())
+np.save(save_dir_model + 'U2.npy', U2m.detach().cpu().numpy())
+np.save(save_dir_model + 'Bg2.npy', Bg2m.detach().cpu().numpy())
+np.save(save_dir_model + 'Bh2.npy', Bh2m.detach().cpu().numpy())
+np.save(save_dir_model + 'zeta2.npy', zeta2.detach().cpu().numpy().item())
+np.save(save_dir_model + 'nu2.npy', nu2.detach().cpu().numpy().item())
+
+if args.model_arch == 'RPool_Face_QVGA_monochrome':
+	weight_idx = 14
+elif args.model_arch == 'RPool_Face_M4':
+	weight_idx = 4
+
+for j in range(weight_idx):
+	F1 = net.state_dict()['module.mob.%d.conv.0.0.weight' % j]
+	shaper = F1.shape
+	F1m = F1.reshape(1, shaper[0], shaper[1], 1, 1).permute(0, 3, 4, 2, 1)
+	w = net.state_dict()['module.mob.%d.conv.0.1.weight' % j]
+	b = net.state_dict()['module.mob.%d.conv.0.1.bias' % j]
+	m = net.state_dict()['module.mob.%d.conv.0.1.running_mean' % j]
+	v = net.state_dict()['module.mob.%d.conv.0.1.running_var' % j]
+	BN1W = torch.mul(torch.rsqrt(torch.add(v, 0.00001)), w)
+	BN1B = torch.sub(torch.mul(b, torch.reciprocal(BN1W)), m)
+
+	F2 = net.state_dict()['module.mob.%d.conv.1.0.weight' % j]
+	shaper = F2.shape
+	F2m = F2.reshape(shaper[0], 1, 1, shaper[2], shaper[3]).permute(0, 3, 4, 2, 1)
+	w = net.state_dict()['module.mob.%d.conv.1.1.weight' % j]
+	b = net.state_dict()['module.mob.%d.conv.1.1.bias' % j]
+	m = net.state_dict()['module.mob.%d.conv.1.1.running_mean' % j]
+	v = net.state_dict()['module.mob.%d.conv.1.1.running_var' % j]
+	BN2W = torch.mul(torch.rsqrt(torch.add(v, 0.00001)), w)
+	BN2B = torch.sub(torch.mul(b, torch.reciprocal(BN2W)), m)
+
+	F3 = net.state_dict()['module.mob.%d.conv.2.weight' % j]
+	shaper = F3.shape
+	F3m = F3.reshape(1, shaper[0], shaper[1], 1, 1).permute(0, 3, 4, 2, 1)
+	w = net.state_dict()['module.mob.%d.conv.3.weight' % j]
+	b = net.state_dict()['module.mob.%d.conv.3.bias' % j]
+	m = net.state_dict()['module.mob.%d.conv.3.running_mean' % j]
+	v = net.state_dict()['module.mob.%d.conv.3.running_var' % j]
+	BN3W = torch.mul(torch.rsqrt(torch.add(v, 0.00001)), w)
+	BN3B = torch.sub(torch.mul(b, torch.reciprocal(BN3W)), m)
+
+	np.save(save_dir_model + 'L%dF1.npy' % j, F1m.detach().cpu().numpy().flatten())
+	np.save(save_dir_model + 'L%dF2.npy' % j, F2m.detach().cpu().numpy().flatten())
+	np.save(save_dir_model + 'L%dF3.npy' % j, F3m.detach().cpu().numpy().flatten())
+
+	np.save(save_dir_model + 'L%dW1.npy' % j, BN1W.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+	np.save(save_dir_model + 'L%dW2.npy' % j, BN2W.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+	np.save(save_dir_model + 'L%dW3.npy' % j, BN3W.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+
+	np.save(save_dir_model + 'L%dB1.npy' % j, BN1B.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+	np.save(save_dir_model + 'L%dB2.npy' % j, BN2B.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+	np.save(save_dir_model + 'L%dB3.npy' % j, BN3B.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+
+k = 0
+for j in range(3, 6):
+	k += 1
+	N = net.state_dict()['module.L2Norm%d_3.weight' % j]
+	np.save(save_dir_model + 'normW%d.npy' % k, N.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+
+for j in range(4):
+	locw = net.state_dict()['module.loc.%d.weight' % j]
+	locwm = locw.permute(2, 3, 1, 0).detach().cpu().numpy().flatten()
+	locb = net.state_dict()['module.loc.%d.bias' % j]
+	confw = net.state_dict()['module.conf.%d.weight' % j]
+	confwm = confw.permute(2, 3, 1, 0).detach().cpu().numpy().flatten()
+	confb = net.state_dict()['module.conf.%d.bias' % j]
+
+	np.save(save_dir_model + 'loc%dw.npy' % j, locwm.reshape(1, locwm.shape[0]))
+	np.save(save_dir_model + 'loc%db.npy' % j, locb.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
+	np.save(save_dir_model + 'conf%dw.npy' % j, confwm.reshape(1, confwm.shape[0]))
+	np.save(save_dir_model + 'conf%db.npy' % j, confb.view(-1).unsqueeze(axis=0).detach().cpu().numpy())
--- a/tf/requirements-cpu.txt
+++ b/tf/requirements-cpu.txt
@ -6,3 +6,6 @@ scikit-learn==0.21.2
 scipy==1.3.0
 tensorflow==1.15.4
 requests
+bokeh==2.1.1
+onnx==1.8.0
+tqdm==4.56.0
--- a/tf/requirements-gpu.txt
+++ b/tf/requirements-gpu.txt
@ -6,3 +6,6 @@ scikit-learn==0.19.2
 scipy==1.1.0
 tensorflow-gpu==1.15.4
 requests
+bokeh==2.1.1
+onnx==1.8.0
+tqdm==4.56.0
--- a/tools/SeeDot/.gitignore
+++ b/tools/SeeDot/.gitignore
@ -0,0 +1,6 @@
+model/
+datasets/
+arduinodump/
+m3dump/
+temp/
+seedot/Predictor/*.o
--- a/tools/SeeDot/.vscode/.ropeproject/config.py
+++ b/tools/SeeDot/.vscode/.ropeproject/config.py
@ -0,0 +1,114 @@
+# The default ``config.py``
+# flake8: noqa
+
+
+def set_prefs(prefs):
+    """This function is called before opening the project"""
+
+    # Specify which files and folders to ignore in the project.
+    # Changes to ignored resources are not added to the history and
+    # VCSs.  Also they are not returned in `Project.get_files()`.
+    # Note that ``?`` and ``*`` match all characters but slashes.
+    # '*.pyc': matches 'test.pyc' and 'pkg/test.pyc'
+    # 'mod*.pyc': matches 'test/mod1.pyc' but not 'mod/1.pyc'
+    # '.svn': matches 'pkg/.svn' and all of its children
+    # 'build/*.o': matches 'build/lib.o' but not 'build/sub/lib.o'
+    # 'build//*.o': matches 'build/lib.o' and 'build/sub/lib.o'
+    prefs['ignored_resources'] = ['*.pyc', '*~', '.ropeproject',
+                                  '.hg', '.svn', '_svn', '.git', '.tox']
+
+    # Specifies which files should be considered python files.  It is
+    # useful when you have scripts inside your project.  Only files
+    # ending with ``.py`` are considered to be python files by
+    # default.
+    # prefs['python_files'] = ['*.py']
+
+    # Custom source folders:  By default rope searches the project
+    # for finding source folders (folders that should be searched
+    # for finding modules).  You can add paths to that list.  Note
+    # that rope guesses project source folders correctly most of the
+    # time; use this if you have any problems.
+    # The folders should be relative to project root and use '/' for
+    # separating folders regardless of the platform rope is running on.
+    # 'src/my_source_folder' for instance.
+    # prefs.add('source_folders', 'src')
+
+    # You can extend python path for looking up modules
+    # prefs.add('python_path', '~/python/')
+
+    # Should rope save object information or not.
+    prefs['save_objectdb'] = True
+    prefs['compress_objectdb'] = False
+
+    # If `True`, rope analyzes each module when it is being saved.
+    prefs['automatic_soa'] = True
+    # The depth of calls to follow in static object analysis
+    prefs['soa_followed_calls'] = 0
+
+    # If `False` when running modules or unit tests "dynamic object
+    # analysis" is turned off.  This makes them much faster.
+    prefs['perform_doa'] = True
+
+    # Rope can check the validity of its object DB when running.
+    prefs['validate_objectdb'] = True
+
+    # How many undos to hold?
+    prefs['max_history_items'] = 32
+
+    # Shows whether to save history across sessions.
+    prefs['save_history'] = True
+    prefs['compress_history'] = False
+
+    # Set the number spaces used for indenting.  According to
+    # :PEP:`8`, it is best to use 4 spaces.  Since most of rope's
+    # unit-tests use 4 spaces it is more reliable, too.
+    prefs['indent_size'] = 4
+
+    # Builtin and c-extension modules that are allowed to be imported
+    # and inspected by rope.
+    prefs['extension_modules'] = []
+
+    # Add all standard c-extensions to extension_modules list.
+    prefs['import_dynload_stdmods'] = True
+
+    # If `True` modules with syntax errors are considered to be empty.
+    # The default value is `False`; When `False` syntax errors raise
+    # `rope.base.exceptions.ModuleSyntaxError` exception.
+    prefs['ignore_syntax_errors'] = False
+
+    # If `True`, rope ignores unresolvable imports.  Otherwise, they
+    # appear in the importing namespace.
+    prefs['ignore_bad_imports'] = False
+
+    # If `True`, rope will insert new module imports as
+    # `from <package> import <module>` by default.
+    prefs['prefer_module_from_imports'] = False
+
+    # If `True`, rope will transform a comma list of imports into
+    # multiple separate import statements when organizing
+    # imports.
+    prefs['split_imports'] = False
+
+    # If `True`, rope will remove all top-level import statements and
+    # reinsert them at the top of the module when making changes.
+    prefs['pull_imports_to_top'] = True
+
+    # If `True`, rope will sort imports alphabetically by module name instead
+    # of alphabetically by import statement, with from imports after normal
+    # imports.
+    prefs['sort_imports_alphabetically'] = False
+
+    # Location of implementation of
+    # rope.base.oi.type_hinting.interfaces.ITypeHintingFactory In general
+    # case, you don't have to change this value, unless you're an rope expert.
+    # Change this value to inject you own implementations of interfaces
+    # listed in module rope.base.oi.type_hinting.providers.interfaces
+    # For example, you can add you own providers for Django Models, or disable
+    # the search type-hinting in a class hierarchy, etc.
+    prefs['type_hinting_factory'] = (
+        'rope.base.oi.type_hinting.factory.default_type_hinting_factory')
+
+
+def project_opened(project):
+    """This function is called after opening the project"""
+    # Do whatever you like here!
--- a/tools/SeeDot/.vscode/.ropeproject/objectdb
+++ b/tools/SeeDot/.vscode/.ropeproject/objectdb
--- a/tools/SeeDot/.vscode/launch.json
+++ b/tools/SeeDot/.vscode/launch.json
@ -0,0 +1,31 @@
+{
+    // Use IntelliSense to learn about possible attributes.
+    // Hover to view descriptions of existing attributes.
+    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
+    "version": "0.2.0",
+    "configurations": [
+
+    
+        {
+            "name": "Python: SeeDot-dev.py",
+            "type": "python",
+            "request": "launch",
+            "program": "${workspaceFolder}/SeeDot-dev.py",
+            "console": "integratedTerminal",
+            "args": [
+                "-a",
+                "fastgrnn",
+                "-e",
+                "fixed",
+                "-d",
+                "usps10",
+                "-m",
+                "red_disagree",
+                "-t",
+                "x86",
+                "-n",
+                "1"
+            ]
+        }
+    ]
+}
--- a/tools/SeeDot/.vscode/settings.json
+++ b/tools/SeeDot/.vscode/settings.json
@ -0,0 +1,9 @@
+{
+    "files.associations": {
+        "iosfwd": "cpp",
+        "vector": "cpp",
+        "cmath": "cpp",
+        "limits": "cpp"
+    },
+    "python.linting.enabled": false
+}
--- a/tools/SeeDot/README.md
+++ b/tools/SeeDot/README.md
@ -4,63 +4,108 @@ SeeDot is an automatic quantization tool that generates efficient machine learni

 ### **Overview**

-ML models are usually expressed in floating-point, and IoT devices typically lack hardware support for floating-point arithmetic. Hence, running such ML models on IoT devices involves simulating floating-point arithmetic in software, which is very inefficient. SeeDot addresses this issue by generating fixed-point code with only integer operations. To enable this, SeeDot takes as input trained floating-point models (like [Bonsai](https://github.com/microsoft/EdgeML/blob/master/docs/publications/Bonsai.pdf) or [ProtoNN](https://github.com/microsoft/EdgeML/blob/master/docs/publications/ProtoNN.pdf)) and generates efficient fixed-point code that can run on microcontrollers. The SeeDot compiler uses novel compilation techniques like automatically inferring certain parameters used in the fixed-point code, optimized exponentiation computation, etc. With these techniques, the generated fixed-point code has comparable classification accuracy and performs significantly faster than the floating-point code.
+ML models are usually expressed in floating-point, and IoT devices typically lack hardware support for floating-point arithmetic. Hence, running such ML models on IoT devices involves simulating floating-point arithmetic in software, which is very inefficient. SeeDot addresses this issue by generating fixed-point code with only integer operations. To enable this, SeeDot takes as input trained floating-point models (like [Bonsai](https://github.com/microsoft/EdgeML/blob/master/docs/publications/Bonsai.pdf) or [ProtoNN](https://github.com/microsoft/EdgeML/blob/master/docs/publications/ProtoNN.pdf) or [FastGRNN](https://github.com/microsoft/EdgeML/blob/master/docs/publications/FastGRNN.pdf)) and generates efficient fixed-point code that can run on micro-controllers. The SeeDot compiler uses novel compilation techniques to automatically infer certain parameters used in the fixed-point code, optimized exponentiation computation, etc. With these techniques, the generated fixed-point code has comparable classification accuracy and performs significantly faster than the floating-point code.

-To know more about SeeDot, please refer to our publication [here](https://www.microsoft.com/en-us/research/publication/compiling-kb-sized-machine-learning-models-to-constrained-hardware/).
+To know more about SeeDot, please refer to our publications [here](https://www.microsoft.com/en-us/research/publication/compiling-kb-sized-machine-learning-models-to-constrained-hardware/) and [here](https://www.microsoft.com/en-us/research/publication/shiftry-rnn-inference-in-2kb-of-ram/).

-This document describes the tool usage with an example.
+This document describes the tool's usage with an example.

 ### **Software requirements**

 1. [**Python 3**](https://www.python.org/) with following packages:
-   - **[Antrl4](http://www.antlr.org/)** (antlr4-python3-runtime; tested with version 4.7.2)
-   - **[Numpy](http://www.numpy.org/)** (tested with version 1.16.2)
-   - **[Scikit-learn](https://scikit-learn.org/)** (tested with version 0.20.3)
+   - **[Antrl4](https://www.antlr.org/)** (antlr4-python3-runtime; tested with version 4.7.2)
+   - **[Numpy](https://www.numpy.org/)** (tested with version 1.16.4)
+   - **[Scikit-learn](https://scikit-learn.org/)** (tested with version 0.21.2)
+   - **[SciPy](https://www.scipy.org/)** (tested with version 1.1.0)
+   - **[Bokeh](https://bokeh.org/)** (tested with version 2.1.1)
+   - **[ONNX](https://onnx.ai/)** (tested with version 1.8.0)
+   - **[TQDM](https://tqdm.github.io/)** (tested with version 4.56.0)
+
+All of the above packages can be installed using the following command in this directory:
+`pip3 install -r requirements.txt`
+
 2. Linux packages:
-   - **[gcc](https://www.gnu.org/software/gcc/)** (tested with version 7.3.0)
-   - **[make](https://www.gnu.org/software/make/)** (tested with version 4.1)
+   - **[gcc](https://www.gnu.org/software/gcc/)** (tested with version 9.3.0)
+   - **[make](https://www.gnu.org/software/make/)** (tested with version 4.2.1)
+   - **[cmake](https://cmake.org/)** (tested with version 3.16.3)
+   - **[protobuf](https://developers.google.com/protocol-buffers)** (tested with version 3.6.1)
+
+All of the above packages can be installed using the following command:
+`sudo apt install gcc g++ build-essential cmake protobuf-compiler libprotobuf-dev`

 ### **Usage**

-SeeDot can be invoked using **`SeeDot.py`** file. The arguments for the script are supplied as follows:
+SeeDot can be invoked using **`SeeDot-dev.py`** file. The arguments for the script are supplied as follows:

 ```
-usage: SeeDot.py [-h] [-a] --train  --test  --model  [--tempdir] [-o]
+usage: SeeDot-dev.py [-h] [-a] [-e] [-d] [-m] [-n] [-dt] [-t] [-s] [-sf] [-l] [-lsf] [-tdr] [-o]

 optional arguments:
-  -h, --help      show this help message and exit
-  -a , --algo     Algorithm to run ('bonsai' or 'protonn')
-  --train         Training set file
-  --test          Testing set file
-  --model         Directory containing trained model (output from
-                  Bonsai/ProtoNN trainer)
-  --tempdir       Scratch directory for intermediate files
-  -o , --outdir   Directory to output the generated Arduino sketch
+  -h,   --help             Show this help message and exit
+  -a,   --algo             Algorithm to run ['bonsai' or 'protonn' or 'fastgrnn' or 'rnnpool'] 
+                           (Default: 'fastgrnn')
+
+  -e,   --encoding         Floating-point ['float'] or Fixed-point ['fixed'] 
+                           (Default: 'fixed')
+
+  -d,   --dataset          Dataset to use 
+                           (Default: 'usps10')
+
+  -m,   --metric           Select the metric that will be used to measure the correctness of an inference, to obtain the 
+                           best quantization of variables.
+                              1) Accuracy ('acc'):                The accuracy of prediction will be used as a metric for 
+                                                                  correctness. (A maximising metric).
+
+                              2) Disagreement Count ('disagree'): The correctness will be measured against the
+                                                                  floating-point code's output. (A minimising metric).
+                              3) Reduced Disagreement Count 
+                                                ('red_disagree'): The correctness will be measured against the
+                                                                  floating-point code's output only when the output matches the correct label. (A minimising metric).
+                           (Default: 'red_disagree')
+
+  -n,   --numOutputs       Number of outputs (e.g., classification problems have only 1 output, i.e., the class label)
+                           (Default: 1)
+
+  -dt,  --datasetType      Dataset type being used ['training', 'testing']
+                           (Default: 'testing')
+
+  -t,   --target           Target device ['x86', 'arduino', 'm3']
+                           (Default: 'x86')
+
+  -s,   --source           Model source type ['seedot', 'onnx', 'tf']
+                           (Default: 'seedot')
+  
+  -sf,  --max-scale-factor Max scaling factor for code generation (If not specified then it will be inferred from data)
+  
+  -l,   --log              Logging level (in increasing order) ['error', 'critical', 'warning', 'info', 'debug']
+                           (Default: 'error')
+
+  -tdr, --tempdir          Scratch directory for intermediate files
+                           (Default: 'temp/')
+
+  -o,   --outdir           Directory to output the generated targetdevice sketch
+                           (Default: 'arduinodump/' for Arduino, 'temp/' for x86 and, 'm3dump/' for M3)
 ```

 An example invocation is as follows:
 ```
-python SeeDot.py -a bonsai --train path/to/train.npy --test path/to/test.npy --model path/to/Bonsai/model
+python SeeDot-dev.py -a fastgrnn -e fixed -d usps10 -n 1 -t arduino -m red_disagree -l info
 ```

-SeeDot expects the `train` and the `test` data files in a specific format. Each data file should be of the shape `[numberOfDataPoints, numberOfFeatures + 1]`, where the class label is in the first column. The tool currently support the following file formats for the data files: numpy arrays (.npy), tab-separated values (.tsv), comma-separated values (.csv), and libsvm (.txt).
+SeeDot expects the `train` and the `test` data files in a specific format. Each data file should be of the shape `[numberOfDataPoints, n + numberOfFeatures]`, where the ground truth/output is in the first `n` columns. The tool currently supports numpy arrays (.npy) for inputting model parameters.
+The data files must be present in the directory `datasets/<algo>/<dataset>`.

-The path to the trained Bonsai/ProtoNN model is specified in the `--model` argument. After training, the learned parameters are stored in this directory in a specific format. For Bonsai, the learned parameters are `Z`, `W`, `V`, `T`, `Sigma`, `Mean`, and `Std`. For ProtoNN, the learned parameters are `W`, `B`, and `Z`. These parameters can be either numpy arrays (.npy) or plaintext files.
+After training, the learned parameters are stored in this directory in a specific format. For FastGRNN, the learned parameters are `W`, `U`, `Bg`, `Bh`, `FC`, `FCBias`, `zeta` and `nu`. These parameters are numpy arrays (.npy). The model files must be present in the directory `model/<algo>/<dataset>`.

-The `tempdir` directory is used to store the intermediate files generated by the compiler. The device-specific fixed-point code is stored in the `outdir` directory.
+The compiler output is present in the `temp` directory for x86, the `arduinodump` directory for arduino, and the `m3dump` directory for m3.

+## Getting started: Quantizing FastGRNN on usps10

-## Getting started: Quantizing ProtoNN on usps10
-
-To help get started with SeeDot, we provide 1) a pre-loaded fixed-point model, and 2) instructions to generate fixed-point code for the ProtoNN predictor on the **[usps10](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/)** dataset. The process for generating fixed-point code for the Bonsai predictor is similar.
-
-### Pre-loaded model
-
-To make it easy to test the SeeDot-generated code, a ready-to-upload Arduino sketch is provided that can be run on an Arduino device without any changes. The sketch is located at `Tools/SeeDot/seedot/arduino/arduino.ino` and contains pre-loaded ProtoNN model on the usps10 dataset. To upload the sketch to the device, skip steps 1-3 in the below guide and follow the [step 4: Prediction on the device](https://github.com/microsoft/EdgeML/tree/Feature/SeeDot/Tools/SeeDot#step-4-prediction-on-the-device).
+To help get started with SeeDot, we provide 1) a pre-loaded fixed-point model, and 2) instructions to generate fixed-point code for the FastGRNN predictor on the **[usps10](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/)** dataset. The process for generating fixed-point code for the Bonsai or ProtoNN predictor is similar.

 ### Generating fixed-point code

-This process consists of four steps: 1) installing EdgeML TensorFlow library, 2) training ProtoNN on usps10, 3) quantizing the trained model with SeeDot, and 4) performing prediction on the device.
+This process consists of four steps: 1) installing EdgeML TensorFlow library, 2) training FastGRNN on usps10, 3) quantizing the trained model with SeeDot, and 4) performing prediction on the device.

 #### **Step 1: Installing EdgeML TensorFlow library**

@ -76,57 +121,64 @@ This process consists of four steps: 1) installing EdgeML TensorFlow library, 2)
     pip install -e .
     ```

-#### **Step 2: Training ProtoNN on usps10** 
+#### **Step 2: Training FastGRNN on usps10**

-1. Navigate to the ProtoNN examples directory.
+1. Navigate to the FastGRNN examples directory.
     ```
-     cd examples/ProtoNN
+     cd ../examples/tf/FastCells
     ```
     
-2. Fetch usps10 data and create output directory.
+2. Fetch and process usps10 data and create an output directory.
     ```
     python fetch_usps.py
     python process_usps.py
-     mkdir usps10/output
-      ```
+     ```

-3. Invoke ProtoNN trainer using the following command.
+3. Invoke FastGRNN trainer using the following command.
      ```
-      python protoNN_example.py --data-dir ./usps10 --projection-dim 25 --num-prototypes 55 --epochs 100 -sW 0.3 -o usps10/output
+      python fastcell_example.py --data-dir ./usps10 --input-dim 16 --hidden-dim 32
      ```
-  This would give around 90.035% classification accuracy. The trained model is stored in the `output` directory.
+  This would give around 93-95% classification accuracy. The trained model is stored in the `usps10/FastGRNNResults/<timestamp>` directory.

-More information on using the ProtoNN trainer can be found [here](https://github.com/Microsoft/EdgeML/tree/master/tf/examples/ProtoNN).
+More information on using the FastGRNN trainer can be found [here](https://github.com/microsoft/EdgeML/tree/master/examples/tf/FastCells).

 #### **Step 3: Quantizing with SeeDot**

-1. Navigate to the SeeDot directory and create the output directory.
+1. Copy the dataset and model files into the correct directory.
+     ```
+     cd ../../../tools/SeeDot/
+     mkdir -p datasets/fastgrnn/usps10
+     mkdir -p model/fastgrnn/usps10
+     cp ../../examples/tf/FastCells/usps10/*.npy ./datasets/fastgrnn/usps10/
+     cp ../../examples/tf/FastCells/usps10/FastGRNNResults/<timestamp>/* ./model/fastgrnn/usps10/
+     ```
+2. Copy the example code for FastGRNN in the SeeDot language:
+     ```
+     cp seedot/compiler/input/fastgrnn.sd model/fastgrnn/usps10/input.sd
+     ```
+
+3. Invoke SeeDot compiler using the following command.
      ```
-      cd ../../../Tools/SeeDot
-      mkdir arduino
+      python SeeDot-dev.py -a fastgrnn -e fixed -t arduino -m red_disagree -n 1 -d usps10
      ```

-2. Invoke SeeDot using the following command.
-      ```
-      python SeeDot.py -a protonn --train ../../tf/examples/ProtoNN/usps10/train.npy --test ../../tf/examples/ProtoNN/usps10/test.npy --model ../../tf/examples/ProtoNN/usps10/output -o arduino
-      ```
-
-   The SeeDot-generated code would give around 89.985% classification accuracy. The difference in classification accuracy is 0.05% compared to the floating-point code. The generated code is stored in the `arduino` folder which contains the sketch along with two files: model.h and predict.cpp. `model.h` contains the quantized model and `predict.cpp` contains the inference code.
+   The SeeDot-generated code would give around 92-95% classification accuracy. The difference in classification accuracy is 0-3% compared to the floating-point code. The generated code is stored in the `arduinodump` folder which contains the sketch along with two files: model.h and predict.cpp. `model.h` contains the quantized model and `predict.cpp` contains the inference code.

 #### **Step 4: Prediction on the device**

 Follow the below steps to perform prediction on the device, where the SeeDot-generated code is run on a single data-point stored on the device's flash memory.

-1. Open the Arduino sketch file located at `arduino/arduino.ino` in the [Arduino IDE](https://www.arduino.cc/en/main/software).
-2. Connect the Arduino microcontroller to the computer and choose the correct board configuration.
-3. Upload the sketch to the device.
-4. Open the Serial Monitor and select baud rate specified in the sketch (default is 115200) to monitor the output.
-5. The average prediction time is computed every 100 iterations. On an Arduino Uno, the average prediction time is 35991 micro seconds.
+1. The model files are generated within `arduinodump/arduino/16/fastgrnn/usps10`. Copy all the files to `arduinodump/arduino`.
+2. Open the Arduino sketch file located at `arduinodump/arduino/arduino.ino` in the [Arduino IDE](https://www.arduino.cc/en/main/software).
+3. Connect the Arduino micro-controller to the computer and choose the correct board configuration.
+4. Upload the sketch to the device.
+5. Open the Serial Monitor and select baud rate specified in the sketch (default is 115200) to monitor the output.
+6. The average prediction time is computed for every iteration. On an Arduino Uno, the average prediction time is around 280000 micro seconds.

 More device-specific details on extending the Arduino sketch for other use cases can be found in [`arduino/README.md`](https://github.com/microsoft/EdgeML/blob/Feature/SeeDot/Tools/SeeDot/seedot/arduino/README.md).


-The above workflow has been tested on Arduino Uno and Arduino MKR1000. It is expected to work on other Arduino devices as well.
+The above workflow has been tested on Arduino Uno. It is expected to work on other Arduino devices as well.


 Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
--- a/tools/SeeDot/SeeDot-dev.py
+++ b/tools/SeeDot/SeeDot-dev.py
@ -0,0 +1,269 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import argparse
+import csv
+import datetime
+from itertools import product
+import json
+import numpy as np
+import os
+import shutil
+import tempfile
+
+import seedot.config as config
+import seedot.main as main
+import seedot.predictor as predictor
+import seedot.util as util
+import logging
+import seedot.compiler.converter.converter as converter
+
+'''
+This is the file which is invoked to run the compiler (Refer to README.md).
+
+Sanity checks are carried out and the main compiler arguments are taken from the user
+which is then used to invoke the main compiler code, 'main.py'.
+
+Note there are 3 different ways to change compiler arguments:
+  1) the arguments used by the user to invoke the compiler
+  2) seedot/config.py
+  3) seedot/util.py
+Different parameters are controlled in different files, refer to each one of them to
+find out how to change one parameter.
+'''
+
+
+class Dataset:
+    common = ["cifar-binary", "cr-binary", "cr-multiclass", "curet-multiclass",
+              "letter-multiclass", "mnist-binary", "mnist-multiclass",
+              "usps-binary", "usps-multiclass", "ward-binary"]
+    extra = ["cifar-multiclass", "dsa", "eye-binary", "farm-beats",
+             "interactive-cane", "spectakoms", "usps10", "whale-binary",
+             "HAR-2", "HAR-6", "MNIST-10", "Google-12", "Google-30", "Wakeword-2",
+             "wider-regression", "wider-mbconv", "face-1", "face-2", "face-2-rewrite", 
+             "face-3", "face-4", "test"]
+    # Datasets for ProtoNN and Bonsai.
+    default = ["usps10"]
+    # Datasets for FastGRNN.
+    # default = ["spectakoms", "usps10", "HAR-2", "HAR-6", "dsa", "MNIST-10", "Google-12", "Google-30", "Wakeword-2"]
+    all = common + extra
+
+    datasetDir = os.path.join("..", "datasets", "datasets")
+    modelDir = os.path.join("..", "model")
+
+    datasetProcessedDir = os.path.join("datasets")
+    modelProcessedDir = os.path.join("model")
+
+
+class MainDriver:
+
+    def parseArgs(self):
+        parser = argparse.ArgumentParser()
+
+        parser.add_argument("-a", "--algo", choices=config.Algo.all,
+                            default=config.Algo.default, metavar='', help="Algorithm to run ['bonsai' or 'protonn' or 'fastgrnn' or 'rnnpool'] \
+                           (Default: 'fastgrnn')")
+        parser.add_argument("-e", "--encoding", choices=config.Encoding.all,
+                            default=config.Encoding.default, metavar='', help="Floating-point ['float'] or Fixed-point ['fixed'] \
+                           (Default: 'fixed')")
+        parser.add_argument("-d", "--dataset", choices=Dataset.all,
+                            default=Dataset.default, metavar='', help="Dataset to use\
+                            (Default: 'usps10')")
+        parser.add_argument("-m", "--metric", choices=config.Metric.all, metavar='',
+                            help="Select the metric that will be used to measure the correctness of an inference, to obtain the \
+                            best quantization of variables. \
+                                ['acc', 'disagree', 'red_diagree'] (Default: 'red_disagree')",default=config.Metric.default)
+        parser.add_argument("-n", "--numOutputs", type=int, metavar='',
+                            help="Number of outputs (e.g., classification problems have only 1 output, i.e., the class label)\
+                           (Default: 1)",default=1)
+        parser.add_argument("-dt", "--datasetType", choices=config.DatasetType.all,
+                            default=config.DatasetType.default, metavar='', help="Dataset type being used ['training', 'testing']\
+                           (Default: 'testing')")
+        parser.add_argument("-t", "--target", choices=config.Target.all,
+                            default=config.Target.default, metavar='', help="Target device ['x86', 'arduino', 'm3'] \
+                            (Default: 'x86')")
+        parser.add_argument("-s", "--source", metavar='', choices=config.Source.all,
+                            default=config.Source.default, help="Model source type ['seedot', 'onnx', 'tf']\
+                           (Default: 'seedot')")
+        parser.add_argument("-sf", "--max-scale-factor", type=int,
+                            metavar='', help="Use the old max-scale mechanism of SeeDot's PLDI’19 paper to determine the scales (If not specified then it will be inferred from data)")
+        parser.add_argument("-l", "--log", choices=config.Log.all,
+                            default=config.Log.default, metavar='', help="Logging level (in increasing order)\
+                             ['error', 'critical', 'warning', 'info', 'debug'] (Default: 'error')")
+        parser.add_argument("-lsf", "--load-sf", action="store_true",
+                            help=argparse.SUPPRESS)
+        parser.add_argument("-tdr", "--tempdir", metavar='',
+                            help="Scratch directory for intermediate files\
+                           (Default: 'temp/')")
+        parser.add_argument("-o", "--outdir", metavar='',
+                            help="Directory to output the generated targetdevice sketch\
+                           (Default: 'arduinodump/' for Arduino, 'temp/' for x86 and, 'm3dump/' for M3)")
+        
+        self.args = parser.parse_args()
+
+        if not isinstance(self.args.algo, list):
+            self.args.algo = [self.args.algo]
+        if not isinstance(self.args.encoding, list):
+            self.args.encoding = [self.args.encoding]
+        if not isinstance(self.args.dataset, list):
+            self.args.dataset = [self.args.dataset]
+        if not isinstance(self.args.datasetType, list):
+            self.args.datasetType = [self.args.datasetType]
+        if not isinstance(self.args.target, list):
+            self.args.target = [self.args.target]
+        if not isinstance(self.args.metric, list):
+            self.args.metric = [self.args.metric]
+
+        if self.args.tempdir is not None:
+            assert os.path.isdir(
+                self.args.tempdir), "Scratch directory doesn't exist"
+            config.tempdir = self.args.tempdir
+        else:
+            config.tempdir = "temp"
+            if os.path.exists(config.tempdir):
+                shutil.rmtree(config.tempdir)
+            os.makedirs(config.tempdir)
+
+        if self.args.outdir is not None:
+            assert os.path.isdir(
+                self.args.outdir), "Output directory doesn't exist"
+            config.outdir = self.args.outdir
+        else:
+            if self.args.target == [config.Target.arduino]:
+                config.outdir = os.path.join("arduinodump", "arduino")
+            elif self.args.target == [config.Target.m3]:
+                config.outdir = os.path.join("m3dump")
+            else:
+                config.outdir = os.path.join(config.tempdir, "arduino")
+            os.makedirs(config.outdir, exist_ok=True)
+
+    def checkMSBuildPath(self):
+        found = False
+        for path in config.msbuildPathOptions:
+            if os.path.isfile(path):
+                found = True
+                config.msbuildPath = path
+
+        if not found:
+            raise Exception("Msbuild.exe not found at the following locations:\n%s\nPlease change the path and run again" % (
+                config.msbuildPathOptions))
+
+    def setGlobalFlags(self):
+        np.seterr(all='warn')
+
+    def setLogLevel(self):
+        logging.basicConfig(level=os.environ.get("LOGLEVEL", self.args.log.upper()))
+
+    def run(self):
+        self.setLogLevel()
+
+        if util.windows():
+            self.checkMSBuildPath()
+
+        self.setGlobalFlags()
+        self.runMainDriver()
+
+    def runMainDriver(self):
+        legacy_scales = self.loadScalesFile()
+
+        for iter in product(self.args.algo, self.args.encoding, self.args.dataset, self.args.target, self.args.metric, [16]):
+            algo, encoding, dataset, target, metric, wordLength = iter
+
+            print("\n========================================")
+            print("Executing on %s %s %s %s" %
+                  (algo, encoding, dataset, target))
+            print("========================================\n")
+
+            datasetDir = os.path.join(
+                Dataset.datasetProcessedDir, algo, dataset)
+            modelDir = os.path.join(
+                Dataset.modelProcessedDir, algo, dataset)
+
+            source_update = ""
+            if self.args.source == config.Source.onnx:
+                source_update = "_onnx"
+
+            trainingInput = os.path.join(datasetDir, "train" + source_update + ".npy")
+            testingInput = os.path.join(datasetDir, "test" + source_update + ".npy")
+
+            try:
+                # The following is particularly for old SeeDot (PLDI '19).
+                # In the new version of SeeDot (named Shiftry, OOPSLA '20), config.wordLength is ALWAYS expected to be 16, which is the base bit-width.
+                # Some variables are demoted to 8 bits, and intermediate variables for multiplication may use 32 bits.
+                if encoding == config.Encoding.floatt:
+                    bitwidth = 'float'
+                elif config.wordLength == 8:
+                    bitwidth = 'int8'
+                elif config.wordLength == 16:
+                    bitwidth = 'int16'
+                elif config.wordLength == 32:
+                    bitwidth = 'int32'
+                else:
+                    assert False
+
+                curr = legacy_scales[algo][bitwidth][dataset]
+
+                expectedAcc = curr['accuracy']
+                if encoding == config.Encoding.fixed:
+                    bestScale = curr['scaleFactor']
+                else:
+                    bestScale = legacy_scales[algo]['int16'][dataset]['scaleFactor']
+
+            except Exception as _:
+                assert self.args.load_sf == False
+                expectedAcc = 0
+
+            if self.args.load_sf:
+                sf = bestScale
+            else:
+                sf = self.args.max_scale_factor
+
+            numOutputs = self.args.numOutputs
+
+            obj = main.Main(algo, encoding, target, trainingInput,
+                            testingInput, modelDir, sf, metric, dataset, numOutputs, self.args.source)
+            obj.run()
+
+            acc = obj.testingAccuracy
+
+            if self.args.load_sf:
+                if acc != expectedAcc:
+                    print("FAIL: Expected accuracy %f%%" % (expectedAcc))
+                elif encoding == config.Encoding.fixed and obj.sf != bestScale:
+                    print("FAIL: Expected best scale %d" % (bestScale))
+                else:
+                    print("PASS")
+
+    def loadScalesFile(self):
+        scales = {}
+        # legacy_scales.csv contains some benchmark results for old SeeDot (PLDI '19).
+        # The CSV file can be updated with better accuracy numbers if a better model is obtained.
+        with open(os.path.join("legacy_scales.csv")) as csvFile:
+            reader = csv.reader(csvFile)
+            for row in reader:
+                algo, bitwidth, dataset = row[0], row[1], row[2]
+
+                if algo not in scales:
+                    scales[algo] = {}
+
+                if bitwidth not in scales[algo]:
+                    scales[algo][bitwidth] = {}
+
+                if dataset not in scales[algo][bitwidth]:
+                    scales[algo][bitwidth][dataset] = {}
+
+                accuracy, scaleFactor = row[3], row[4]
+
+                if not accuracy:
+                    accuracy = 100
+                if not scaleFactor:
+                    scaleFactor = 9999
+
+                scales[algo][bitwidth][dataset] = {"accuracy": float(accuracy), "scaleFactor": int(scaleFactor)}
+
+        return scales
+
+if __name__ == "__main__":
+    obj = MainDriver()
+    obj.parseArgs()
+    obj.run()
--- a/tools/SeeDot/SeeDot.py
+++ b/tools/SeeDot/SeeDot.py
@ -1,94 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT license.
-
-import argparse
-import datetime
-from distutils.dir_util import copy_tree
-import os
-import shutil
-import operator
-import tempfile
-import traceback
-
-import seedot.common as Common
-from seedot.main import Main
-import seedot.util as Util
-
-
-class MainDriver:
-
-    def parseArgs(self):
-        parser = argparse.ArgumentParser()
-
-        parser.add_argument("-a", "--algo", choices=Common.Algo.All,
-                            metavar='', help="Algorithm to run ('bonsai' or 'protonn')")
-        parser.add_argument("--train", required=True,
-                            metavar='', help="Training set file")
-        parser.add_argument("--test", required=True,
-                            metavar='', help="Testing set file")
-        parser.add_argument("--model", required=True, metavar='',
-                            help="Directory containing trained model (output from Bonsai/ProtoNN trainer)")
-        #parser.add_argument("-v", "--version", default=Common.Version.Fixed, choices=Common.Version.All, metavar='',
-        #                    help="Datatype of the generated code (fixed-point or floating-point)")
-        parser.add_argument("--tempdir", metavar='',
-                            help="Scratch directory for intermediate files")
-        parser.add_argument("-o", "--outdir", metavar='',
-                            help="Directory to output the generated Arduino sketch")
-
-        self.args = parser.parse_args()
-
-        # Verify the input files and directory exists
-        assert os.path.isfile(self.args.train), "Training set doesn't exist"
-        assert os.path.isfile(self.args.test), "Testing set doesn't exist"
-        assert os.path.isdir(self.args.model), "Model directory doesn't exist"
-
-        if self.args.tempdir is not None:
-            assert os.path.isdir(
-                self.args.tempdir), "Scratch directory doesn't exist"
-            Common.tempdir = self.args.tempdir
-        else:
-            Common.tempdir = os.path.join(tempfile.gettempdir(
-            ), "SeeDot", datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
-            os.makedirs(Common.tempdir, exist_ok=True)
-
-        if self.args.outdir is not None:
-            assert os.path.isdir(
-                self.args.outdir), "Output directory doesn't exist"
-            Common.outdir = self.args.outdir
-        else:
-            Common.outdir = os.path.join(Common.tempdir, "arduino")
-            os.makedirs(Common.outdir, exist_ok=True)
-
-    def checkMSBuildPath(self):
-        found = False
-        for path in Common.msbuildPathOptions:
-            if os.path.isfile(path):
-                found = True
-                Common.msbuildPath = path
-
-        if not found:
-            raise Exception("Msbuild.exe not found at the following locations:\n%s\nPlease change the path and run again" % (
-                Common.msbuildPathOptions))
-
-    def run(self):
-        if Util.windows():
-            self.checkMSBuildPath()
-
-        algo, version, trainingInput, testingInput, modelDir = self.args.algo, Common.Version.Fixed, self.args.train, self.args.test, self.args.model
-
-        print("\n================================")
-        print("Executing on %s for Arduino" % (algo))
-        print("--------------------------------")
-        print("Train file: %s" % (trainingInput))
-        print("Test file: %s" % (testingInput))
-        print("Model directory: %s" % (modelDir))
-        print("================================\n")
-
-        obj = Main(algo, version, Common.Target.Arduino,
-                   trainingInput, testingInput, modelDir, None)
-        obj.run()
-
-if __name__ == "__main__":
-    obj = MainDriver()
-    obj.parseArgs()
-    obj.run()
--- a/tools/SeeDot/init.py
+++ b/tools/SeeDot/init.py
--- a/tools/SeeDot/architecture.md
+++ b/tools/SeeDot/architecture.md
@ -0,0 +1,127 @@
+# SeeDot Architecture
+
+This document describes the overall architecture of the SeeDot quantization tool. 
+
+SeeDot is run by executing the `SeeDot-dev.py` python script. 
+
+Running SeeDot using default arguments, i.e the call
+```
+    python SeeDot-dev.py
+```
+is equivalent to running the [FastGRNN](https://github.com/microsoft/EdgeML/blob/master/docs/publications/FastGRNN.pdf) algorithm with `fixed-point` encoding on the `usps10` dataset for `x86` target device; i.e the call 
+```
+    python SeeDot-dev.py -a fastgrnn -e fixed -d usps10 -n 1 -t x86 -m red_disagree -l error
+```
+
+## Walkthrough
+
+In this text, we will discuss the execution of SeeDot for the `FastGRNN` algorithm on the `usps10` dataset.
+We'll discuss the execution for both Floating-point and Fixed-point encoding with `reduced disagreements` as the 
+correctness metric. 
+
+A diagrammatic representation of SeeDot's architecture:
+![SeeDot's Architecture](https://github.com/microsoft/EdgeML/blob/master/tools/SeeDot/architecture.svg)
+
+### Model and Datasets
+
+For the run to take place, the model should be placed in the `model/fastgrnn/usps10/` directory, and the datasets 
+should be placed in the `datasets/fastgrnn/usps10/` directory.
+The steps in the walkthrough are based on the assumption that both these have been done. 
+See [README.md](https://github.com/microsoft/EdgeML/blob/master/tools/SeeDot/README.md) for instructions on obtaining these. 
+
+### Floating-point Encoding
+
+The floating-point SeeDot for FastGRNN is run using the following command:
+```
+    python SeeDot-dev.py -a fastgrnn -e float -d usps10 -n 1 -t x86 -m red_disagree
+```
+
+On running this command, 
+
+1. `main` function in `SeeDot-dev.py` initiates the `MainDriver` class in `SeeDot-dev.py`, which in turn initiates the `Main` class in `seedot/main.py`. 
+
+2. In `seedot/main.py`, depending on the encoding, the `funForFloat` or `runForFixed` function is called. (In this case, `runForFloat` will be called). 
+
+3. In `runForFloat`, first the `Converter` class in `seedot/compiler/converter/converter.py` is instantiated with the arguments `(encoding=float, datasetType=Testing, target=x86)`. 
+
+    (The arguments here are for representation purposes and to explain the differences between the multiple `Converter` instances). 
+
+4. In the `Converter` class, the SeeDot input files (all the model weights and, the train and test datasets) are processed to a format that is required by the SeeDot compiler. 
+
+5. In `seedot/compiler/converter/coverter.py`, the quantization code is called, which produces files used by the SeeDot compiler. 
+    Depending on the encoding, `QuantizerFloat` or `QuantizerFixed` classes are instantiated for `float` or `fixed` encoding respectively. 
+    These classes are defined in `seedot/compiler/converter/quantizer.py`. (In our case `QuantizerFloat`).
+
+6. The `QuantizerFloat` class uses the `ParamsBuilder` class from `seedot/compiler/converter/paramsBuilder.py` to quantize the parameters. 
+
+7. After the `Converter` class finishes execution, the control returns to `seedot/main.py:runForFloat`. 
+
+8. Now, the `Compiler` class in `seedot/compiler/compiler.py` is instantiated. 
+
+9. The `Compiler` class reads `model/fastgrnn/usps10/input.sd` and creates an **Abastract Syntax Tree (AST)**. 
+    This step uses the tokens in `seedot/compiler/antlr/seedotTokens.py` and the grammar in `seedot/compiler/antlr/seedotParser.py` which are in turn generated by __ANTLR 4.7__ software from `seedot/compiler/antlr/seedot.tokens` and `seedot/compiler/antlr/seedot.g4` respectively.
+
+10. Then, type inference is performed on the AST followed by conversion to SeeDot's **Intermediate Representation (IR)**.
+    This is done using the `IRBuilder` class in `seedot/compiler/ir/irBuilder.py`. This IR is a series of function calls (and the corresponding arguments) whose end result is the required output. This set of function calls is also referred to as the main body of SeeDot's inference. 
+
+11. Once the IR is generated, the control is passed back to the `Compiler` class. 
+    `Compiler` calls the target code-generator (for x86, `x86` class in `seedot/compiler/codegen/x86.py`; for Arduino, `Arduino` class in `seedot/compiler/codegen/arduino.py`; for M3, `M3` class in `seedot/compiler/codegen/m3.py`), which generates the CPP code for the target device specified. (At this step, `x86` code-generator is instantiated).
+
+12. After the code-generation, the Control is passed back to the `Compiler` which in turn passes the control back to `seedot/main.py:runForFloat`. 
+
+13. `seedot/main.py:runForFloat` instatiates the `Predictor` class in `seedot/predictor.py`. 
+
+14. `Predictor` builds the x86 temp files generated by the compiler. This build is supported for `Windows` and `Linux`. 
+
+15. Once the build is completed, the code is run and the result is printed to the console. 
+
+16. If a non-x86 target is specified, the target codes are are generated using the code-generators specified in point 11. The target codes are stored in `arduinodump/` for Arduino and `m3dump/` for M3 by default. 
+
+### Fixed-point Encoding
+
+The fixed-point SeeDot for FastGRNN is run using the following command:
+```
+    python SeeDot-dev.py -a fastgrnn -e fixed -d usps10 -n 1 -t x86 -m red_disagree
+```
+
+On running this command, 
+
+1. `main` function in `SeeDot-dev.py` initiates the `MainDriver` class in `SeeDot-dev.py`, which in turn initiates the `Main` class in `seedot/main.py`. 
+2. In `seedot/main.py`, depending on the encoding, the `funForFloat` or `runForFixed` function is called. (In this case, `runForFixed` will be called). 
+3. In `runForFixed`, first the `collectProfileData` function is called. This collects the data that helps determines the most efficient scale and bitwidth of variables.
+
+4. In `collectProfileData` the `Converter` class in `seedot/compiler/converter/converter.py` is instantiated with the arguments `(encoding=float, datasetType=Training, target=x86)`. (The arguments here are for representation purposes and to explain the differences between the multiple `Converter` instances). 
+5. In `Converter`, the quantization code is called, which produces files used by the SeeDot compiler. Depending on the encoding, `QuantizerFloat` or `QuantizerFixed` classes are instantiated for `float` and `fixed` encoding respectively. These classes are defined in `seedot/compiler/converter/quantizer.py`. (In our case `QuantizerFoat`).
+6. The `QuantizerFloat` class uses the `ParamsBuilder` class from `seedot/compiler/converter/paramsBuilder.py` to quantize the parameters. 
+7. After the `Converter` class finished execution, the control return to `seedot/main.py:collectProfileData`.
+8. Now, the `Compiler` class in `seedot/compiler/compiler.py` is instantiated. 
+9. The `Compiler` class reads `model/fastgrnn/usps10/input.sd` and creates an **Abastract Syntax Tree (AST)**. 
+    This step uses the tokens in `seedot/compiler/antlr/seedotTokens.py` and the grammar in `seedot/compiler/antlr/seedotParser.py` which are in turn generated by `ANTLR 4.7` software from `seedot/compiler/antlr/seedot.tokens` and `seedot/compiler/antlr/seedot.g4` respectively.
+10. Then, type inference is performed on the AST followed by conversion to SeeDot's **Intermediate Representation (IR)**. 
+    This is done using the `IRBuilder` class in `seedot/compiler/ir/irBuilder.py`. 
+    This IR is a series of function calls (and the corresponding arguments) whose end result is the required output. This set of function calls is also referred to as the main body of SeeDot's inference. 
+11. Once the IR is generated, the control is passed back to the `Compiler` class. 
+    `Compiler` calls the target code-generator (for x86, `x86` class in `seedot/compiler/codegen/x86.py`; for Arduino, `Arduino` class in `seedot/compiler/codegen/arduino.py`; for M3, `M3` class in `seedot/compiler/codegen/m3.py`), which generates the CPP code for the target device specified. (In this case, `x86`).
+12. After the code-generation, the Control is passed back to the `Compiler` which in turn passes the control back to `seedot/main.py:collectProfileData`. 
+13. The x86 target code is built and run by the `Predictor` class in `seedot/predictor.py` called by `seedot/main.py:collectProfileData`.
+14. This run of the x86 target code generates profiling data that is used by the next stages of SeeDot. 
+15. The `Predictor` returns control to `seedot/main.py:collectProfileData` which in turn return control to `seedot/main.py:runForFixed`.
+
+16. `seedot/main.py:runForFixed` starts the exploration by first instantiating and running a `Converter` in `seedot/compiler/converter/converter.py` with the arguments `(encoding=fixed, datasetType=Training, target=x86)`. 
+    This in turn uses the `QuantizerFixed` in  `seedot/compiler/converter/quantizer.py`.
+17. After the `Converter` returns, the control is passed to `seedot/main.py: performSearch`. 
+18. `seedot/main.py: performSearch` performs the 4 stages of exploration to find the optimal bitwidth and scale allocation for each variable.
+    1. Stage I: Determine the best scale for the input variable 'X'. 'X' is the variable that represents the input given to the inference model.
+    2. Stage II: Determine the best scale for non-'X' variables. 
+    3. Stage III: Demote variables one at a time to measure impact on the correctness of inference. 
+    4. Stage IV: Demote variables cumulatively to find the most optimal assignment to reduce the model size and latency.
+
+    Each of these stages calls the `Compiler` and `Predictor` multiple times to run the inference model. 
+19. Once, the exploration is complete, `seedot/main.py: performSearch` stores the scale and bitwidth assignments as class variables and return control to `seedot/main.py:runForFixed`. 
+20. `seedot/main.py:runForFixed` calls `Converter` again with the arguments `(encoding=fixed, datasetType=Testing, target=x86)`. This generates the files required for SeeDot's compilation for inference over the testing dataset. 
+21. Once the `Converter` returns, `seedot/main.py:runForFixed` calls the `Compiler` and then the `Predictor` to obtain the final inference accuracy (correctness). 
+22. Finally, if the target is non-x86, then `seedot/main.py:runForFixed` calls the `Compiler` class which generates the code for the target device using the code-generators mentioned in point 11 above.
+23. The target codes are stored in `arduinodump/` for Arduino and `m3dump/` for M3 by default. 
+
+
+Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
--- a/tools/SeeDot/architecture.svg
+++ b/tools/SeeDot/architecture.svg
--- a/tools/SeeDot/extendingCompiler.md
+++ b/tools/SeeDot/extendingCompiler.md
@ -0,0 +1,433 @@
+# Extending The Compiler
+
+SeeDot translates the given input code into a sequence of function calls. The functions are already implemented in a library. What if the model being implemented needs a function which does not already implemented in the library? We show how to add the convolution operator.
+
+**Note**: We only show how to add a new node for X86 architecture. Adding the node for Arduino is omitted for brevity, but follows very closely with the X86 guidelines.
+
+### STEP 1: Operator description
+
+We present an implementation of convolution operation which supports padding, strides, dilations, as well as groups (for depthwise-separable convolutions) which is similar to what is provided by pytorch [here](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html?highlight=conv2d#torch.nn.Conv2d).
+
+Suppose our syntax looks like the following:
+``` ocaml
+    let A = <some N*H*W*C tensor>
+    let B = <some G*FH*FW*C*Cout tensor>
+    let C = conv2d(A, B, {s 1 1}, {p 0 0 1 1}, {d 1 1}, {g 2})
+```
+In this syntax, `s` represents the 2 stride parameters (vertical, horizontal), `p` represents the 4 padding parameters (up, down, left, right), `d` represents the 4 dilation parameters (vertical, horizontal), `g` represents the number of groups. For details please refer to the pytorch documentation of `conv2d` given above.
+
+### STEP 2: Adding the grammar in the file
+
+1. The grammar is in the file `SeeDot\seedot\compiler\antlr\seedot.g4`. Add the following (feel free to change symbols if preferred):
+``` bash
+    | Conv2d '(' expr ',' expr ','
+    '{s' IntConst IntConst '}' ','
+    '{p' IntConst IntConst IntConst IntConst'}' ','
+    '{d' IntConst IntConst '}' ','
+    '{g'IntConst'}' ')' # convolution
+```
+to the rules for `expr`. The name after the # (convolution) is the name which would be generated for this operator when we generate the parser. (Watch out for methods named `visitConvolution` below).
+
+2. After updating the grammar file, we need to generate a new parser. Navigate to `SeeDot\seedot\lib\` in a terminal and execute the following command:
+``` bash
+    java -jar .\antlr-4.7-complete.jar ..\compiler\antlr\seedot.g4 -visitor -Dlanguage=Python3 -o kl
+```
+This will generate the new parser and associated files in the folder `SeeDot\seedot\lib\compiler\antlr\`.
+
+3. Within the folder `SeeDot\seedot\lib\compiler\antlr\` there will be 6 files. Delete `seedotListener.py` (it is not needed) and copy the rest of the files to the directory `SeeDot\seedot\compiler\antlr\` (there already will be 5 files of the same name which you must overwrite). After this SeeDot's parser will be updated.
+
+### STEP 3: Adding new nodes in the AST
+
+Since we are adding a new rule for the expr, we will need to add a new node for our convolution operation to the abstract syntax tree (AST).
+1. Go to the file `SeeDot\seedot\compiler\ast\ast.py`.
+Here, we will add a node which captures all the relevant information for the convolution node. We have discussed above that the node has an input image, an input filter, properties called stride, padding, dilation, groups. So we add the following class to the end of the file:
+``` python
+    class Convolution(ASTNode):
+        
+        def __init__(self, expr1, expr2, stride, padding, dilation, groups):
+            super().__init__()
+            self.expr1 = expr1
+            self.expr2 = expr2
+            self.stride = stride
+            self.padding = padding
+            self.dilation = dilation
+            self.groups = groups
+```
+2. We also need a mechanism to construct an object of the class `Convolution`. Navigate to `SeeDot\seedot\compiler\ast\astBuilder.py`, and add the following method as a member of the class `ASTBuilder`:
+``` python
+    def visitConvolution(self, ctx: SeeDotParser.ConvolutionContext):
+        expr1 = self.visit(ctx.expr(0))
+        expr2 = self.visit(ctx.expr(1))
+        stride = [int(ctx.IntConst(i).getText()) for i in range(0, 2)]
+        padding = [int(ctx.IntConst(i).getText()) for i in range(2, 6)]
+        dilation = [int(ctx.IntConst(i).getText()) for i in range(6, 8)]
+        groups = int(ctx.IntConst(8).getText())
+        return AST.Convolution(expr1, expr2, stride, padding, dilation, groups)
+```
+We explain the above builder method. Take a look at the addition to the grammar file in **STEP 1**. In a convolution node, the first two sub-expressions signify the two inputs. Since they are sub-expressions within our convolution expression, we must recurse down both of them one after the other. `ctx.expr(0)` refers to first sub-expression, `ctx.expr(1)` refers to the second sub-expression.
+After the first two sub-expressions, we have 2 integers (0, 1) for the stride parameter, 4 integers (2, 3, 4, 5) for padding, 2 integers (6, 7) for dilation, and 1 integer (8) for groups. All of these numbers are read by the parser in the same order, and we extract the numbers. Thus the `ctx.expr()`, `ctx.IntConst(i).getText()` etc. are pre-generated from the parser in **STEP 2**. After extracting all parameters, we simply construct the `Convolution` class and return.
+
+3. Navigate to `SeeDot\seedot\compiler\ast\astVisitor.py`. The `ASTVisitor` class is inherited by the type checker, IR generator etc. In the `visit()` method, before the else branch, add the following:
+``` python
+    elif isinstance(node, AST.Convolution):
+        return self.visitConvolution(node)
+```
+4. Edit some boilerplate code. In `SeeDot\seedot\compiler\ast\mtdAST.py`, within the `MtdAST` class, add the following method:
+``` python
+    def visitConvolution(self, node: AST.Convolution, mtd: dict):
+        node.metadata.update(mtd)
+        self.visit(node.expr1, mtd)
+        self.visit(node.expr2, mtd)
+```
+Include all the sub-expressions of the node which is being added, but no need to add `int` constants (which are not recursively explored).
+
+5. In `SeeDot\seedot\compiler\ast\printAST.py`, within the `PrintAST` class, add the following method:
+``` python
+    def visitConvolution(self, node: AST.Convolution):
+        node.expr1.printLevel = node.expr2.printLevel = node.printLevel + 1
+        print(indent * node.printLevel, "conv(", )
+        self.visit(node.expr1)
+        self.visit(node.expr2)
+        print(",", node.stride, ',', node.padding, ',', node.dilation, ',',node.groups, ')')
+```
+This is for pretty printing. Feel free to edit if not pretty enough.
+
+### STEP 4: Type Checking
+
+Navigate to `SeeDot\seedot\compiler\type.py` and add the following code as a member method of `Type` class:
+``` python
+def visitConvolution(self, node: ast.Convolution):
+    node.expr1.gamma = dict(node.gamma)
+    eType = self.visit(node.expr1)
+
+    assert eType.dim == 4
+    [n, h, w, cin] = eType.shape
+
+    node.expr2.gamma = dict(node.gamma)
+    fType = self.visit(node.expr2)
+
+    assert fType.dim == 5
+    [g, hf, wf, cin_, cout] = fType.shape
+
+    assert cin_ * g == cin
+    assert g == node.groups
+
+    assert hf % 2 == wf % 2 == 1, "Odd filter sizes supported"
+
+    for i in range(0,4):
+        assert node.padding[i] >= 0, "Padding cannot be negative"
+        assert node.stride[0] > 0 and node.stride[1] > 0, "Stride must be positive"
+        assert node.dilation[0] > 0 and node.dilation[1] > 0, "Dilation must be positive"
+
+    hout = (h + node.padding[0] + node.padding[1] - node.dilation[0] * (hf - 1) - 1) // node.stride[0] + 1
+    wout = (w + node.padding[2] + node.padding[3] - node.dilation[1] * (wf - 1) - 1) // node.stride[1] + 1
+    shape = [n, hout, wout, g * cout]
+
+    node.type = Tensor(shape)
+    return node.type
+ ```
+This function checks whether the types of the inputs are compatible with the outputs, whether there are illegal values and so on.
+For example, the following are the checks made for convolution:
+1. The input image should be of dimension 4 corresponding to (batch, number of rows, number of columns, channels), and the filter should match the number of channels.
+2. Our implementation does not support even filter sizes, so we can make a check here to only allow odd filter sizes or throw an exception.
+3. The values for padding should not be negative, neither should dilations be negative
+The method also computes the type of the output node, given the input (look at the assignment `node.type = Tensor(shape)`, which is returned).
+For other operators, different checks may be there. For example for addition, the dimensions of both inputs should match and the output should also have the same dimension as inputs.
+
+### STEP 5: Implementing the operator
+**Note**: *This part can be tricky as it involves a good understanding of fixed point arithmetic and how to best tune the hyperparameters which are introduced because of fixed point.
+It is recommended to go through MatAdd, MatMul, Exponentiation function codes to get an idea of how to write fixed point operations.*
+
+1. We first add an implementation of the operator in floating point. We navigate to `SeeDot\seedot\Predictor\library_float.h` and add the following:
+``` cpp
+void Convolution(float *A, const float *B, float *C, float *tmp, MYINT N, MYINT H, MYINT W, MYINT CIN, MYINT HF, MYINT WF, MYINT CINF, MYINT COUTF, MYINT HOUT, MYINT WOUT, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR, MYINT HSTR, MYINT WSTR, MYINT HDL, MYINT WDL, MYINT G, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+```
+2. We navigate to `SeeDot\seedot\Predictor\library_float.cpp` and add the following:
+``` cpp
+void Convolution(float *A, const float *B, float *C, float *tmp, MYINT N, MYINT H, MYINT W, MYINT CIN, MYINT HF, MYINT WF, MYINT CINF, MYINT COUTF, MYINT HOUT, MYINT WOUT, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR, MYINT HSTR, MYINT WSTR, MYINT HDL, MYINT WDL, MYINT G, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+    MYITE HOffsetL = HDL*(HF/2) - HPADL;
+    MYITE WOffsetL = WDL*(WF/2) - WPADL;
+    MYITE HOffsetR = HDL*(HF/2) - HPADR;
+    MYITE WOffsetR = WDL*(WF/2) - WPADR;
+
+    for (MYITE n = 0; n < N; n++) {
+        for (MYITE h = HOffsetL, hout = 0; h < H - HOffsetR; h += HSTR, hout++) {
+            for (MYITE w = WOffsetL, wout = 0; w < W - WOffsetR; w += WSTR, wout++) {
+                for (MYITE g = 0; g < G; g++) {
+                    for (MYITE co = 0; co < COUTF; co++) {
+                        MYITE counter = 0;
+
+                        for (MYITE hf = -(HF/2); hf <= HF/2; hf++) {
+                            for (MYITE wf = -(WF/2); wf <= WF/2; wf++) {
+                                for (MYITE ci = 0; ci < CINF; ci++) {
+                                    float a = (((h + HDL * hf) < 0) || ((h + HDL * hf) >= H) || ((w + WDL * wf) < 0) || ((w + WDL * wf) >= W)) ? 0 : A[n * H * W * CIN + (h + HDL * hf) * W * CIN + (w + WDL * wf) * CIN + (ci + g * CINF)];
+                                    float b = B[g * HF * WF * CINF * COUTF + (hf + HF/2) * WF * CINF * COUTF + (wf + WF/2) * CINF + COUTF + ci * COUTF + co];
+                                    tmp[counter] = a * b;
+                                    counter++;
+                                }
+                            }
+                        }
+
+                        MYITE totalEle = HF * WF * CINF;
+                        MYITE count = HF * WF * CINF, depth = 0;
+
+                        bool shr = true;
+                        while (depth < (H1 + H2)) {
+                            if (depth >= H1) {
+                                shr = false;
+                            }
+                            for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+                                float sum;
+                                if (p < (count >> 1)) {
+                                    sum = tmp[2 * p] + tmp[(2 * p) + 1];
+                                } else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+                                    sum = tmp[2 * p];
+                                } else {
+                                    sum = 0;
+                                }
+
+                                if (shr) {
+                                    tmp[p] = sum;
+                                } else {
+                                    tmp[p] = sum;
+                                }
+                            }
+
+                            count = (count + 1) >> 1;
+                            depth++;
+                        }
+                        C[n * HOUT * WOUT * (COUTF * G) + hout * WOUT * (COUTF * G) + wout * (COUTF * G) + (co + g * COUTF)] = tmp[0];
+                    }
+                }
+            }
+        }
+    }
+}
+```
+In most cases, this implementation itself can act as a reference for adding other operators. The code from `MYITE counter = 0;` to `MYITE totalEle = HF * WF * CINF` is the
+standard convolution operation, and the while loop in the code is tree-sum addition (From SeeDot, Gopinath et al., PLDI 2019).
+
+3. Navigate to `SeeDot\seedot\Predictor\library_fixed.h` and append the following:
+``` cpp
+template<class TypeA, class TypeB, class TypeTemp, class TypeC>
+void Convolution(TypeA *A, const TypeB *B, TypeC *C, TypeTemp *tmp, MYINT N, MYINT H, MYINT W, MYINT CIN, MYINT HF, MYINT WF, MYINT CINF, MYINT COUTF, MYINT HOUT, MYINT WOUT, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR, MYINT HSTR, MYINT WSTR, MYINT HDL, MYINT WDL, MYINT G, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2, MYINT demote) {
+    // Most parameters are self explanatory ones (parameters like inputs, outputs, sizes of inputs and outputs and filters). TypeA, TypeB, TypeTemp, TypeC can be int8_t, int16_t or int32_t, and represent the bitwidths for input image, input filter, temp buffer, output image.
+    // shrA, shrB, H1, H2, demote are parameters pertaining to fixed point code:
+    // shrA and shrB and demote are used for controlling scale during fixed point multiplication (Refer to Section 2.2 and Section 3 of the paper)
+    // H1 and H2 are parameters of tree sum (the while loop in the function), which is used to sum up long vectors without losing significant bits
+    MYITE HOffsetL = HDL*(HF/2) - HPADL;
+    MYITE WOffsetL = WDL*(WF/2) - WPADL;
+    MYITE HOffsetR = HDL*(HF/2) - HPADR;
+    MYITE WOffsetR = WDL*(WF/2) - WPADR;
+
+    for (MYITE n = 0; n < N; n++) {
+        for (MYITE h = HOffsetL, hout = 0; h < H - HOffsetR; h += HSTR, hout++) {
+            for (MYITE w = WOffsetL, wout = 0; w < W - WOffsetR; w += WSTR, wout++) {
+                for (MYITE g = 0; g < G; g++) {
+                    for (MYITE co = 0; co < COUTF; co++) {
+
+                        MYITE counter = 0;
+                        for (MYITE hf = -(HF/2); hf <= HF/2; hf++) {
+                            for (MYITE wf = -(WF/2); wf <= WF/2; wf++) {
+                                for (MYITE ci = 0; ci < CINF; ci++) {
+
+                                    TypeTemp a = (TypeTemp) (((h + HDL * hf) < 0) || ((h + HDL * hf) >= H) || ((w + WDL * wf) < 0) || ((w + WDL * wf) >= W)) ? 0 : A[n * H * W * CIN + (h + HDL * hf) * W * CIN + (w + WDL * wf) * CIN + (ci + g * CINF)];
+                                    TypeTemp b = (TypeTemp) B[g * HF * WF * CINF * COUTF + (hf + HF/2) * WF * CINF * COUTF + (wf + WF/2) * CINF * COUTF + ci * COUTF + co];
+                                    tmp[counter] = a * b;
+                                    counter++;
+                                }
+                            }
+                        }
+
+                        MYITE totalEle = HF * WF * CINF;
+                        MYITE count = HF * WF * CINF, depth = 0;
+
+                        bool shr = true;
+                        while (depth < (H1 + H2)) {
+                            if (depth >= H1) {
+                                shr = false;
+                            }
+                            for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+                                TypeTemp sum;
+                                if (p < (count >> 1)) {
+                                    if (shr) {
+                                        sum = tmp[2 * p] / 2 + tmp[(2 * p) + 1] / 2;
+                                    } else {
+                                        sum = tmp[2 * p] + tmp[(2 * p) + 1];
+                                    }
+                                } else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+                                    if (shr) {
+                                        sum = tmp[2 * p] / 2;
+                                    } else {
+                                        sum = tmp[2 * p];
+                                    }
+                                } else {
+                                    sum = 0;
+                                }
+
+                                tmp[p] = sum;
+                            }
+
+                            count = (count + 1) >> 1;
+                            depth++;
+                        }
+
+                        C[n * HOUT * WOUT * (COUTF * G) + hout * WOUT * (COUTF * G) + wout * (COUTF * G) + (co + g * COUTF)] = Saturate<TypeC>(((tmp[0] / shrA) / shrB) / demote);
+                    }
+                }
+            }
+        }
+    }
+}
+```
+
+### STEP 6: Handling the operator in IR
+**Note**: *This part can also be tricky as the user would need to know how to compute bit-widths, scales for the outputs given the input parameters. It is recommended to study MatAdd, MatMul, Exponentiation, TanH functions for a good understanding*
+
+Navigate to `SeeDot\seedot\compiler\ir\irBuilder.py`, and add the following as a member method of the `IRBuilder` class:
+``` Python
+def visitConvolution(self, node: AST.Convolution):
+
+    (prog_in_A, expr_in_A) = self.visit(node.expr1)
+    (prog_in_B, expr_in_B) = self.visit(node.expr2)
+
+# The above two are required to explore the AST of the input subexpression. If a node has n subexpressions, there would be n statements instead of the two above for each subexpression
+
+    [expr_treeSum, expr_out] = self.getTempVars(2)
+
+# In the above a new temporary variable will be assigned to both the output, and the temporary buffer used by the function
+
+    [N, H, W, Cin] = node.expr1.type.shape
+    [G, Hf, Wf, CinF, CoutF] = node.expr2.type.shape
+
+ # The type checked inputs' dimensions are extracted from node, as out implementation expects these values. It may or may not be necessary to extract all such information
+
+    type_treeSum = Type.Tensor([Hf * Wf * CinF])
+    type_out = node.type
+
+ # In the convolution operator, a temporary buffer is created whose type can be inferred from the inputs' types, but will not be encountered in the source code. Hence they are declared here
+
+    bitwidth_in_A, scale_in_A = self.getBitwidthAndScale(expr_in_A.idf)
+    bitwidth_in_B, scale_in_B = self.getBitwidthAndScale(expr_in_B.idf)
+    if self.ddsEnabled:
+        bitwidth_out, scale_out = self.getBitwidthAndScale(expr_out.idf)
+        bitwidth_temp, scale_temp = self.getBitwidthAndScale(expr_out.idf, native=True)
+    else:
+        bitwidth_out = config.wordLength // 2 if expr_out.idf in self.demotedVarsList else config.wordLength
+        scale_out, scale_temp = None, None
+        bitwidth_temp = bitwidth_out
+
+# ddsEnabled flag being true means that the output's scale is computed through profiling. if the flag is false, the scale would have to be computed manually. The getBitwidtAndScale method always returns the scale and the bitwidth assignment (It is known which all variables are in 8 bits before the IRBuilder class is invoked)
+
+    intv_in_A, intv_in_B = (0, 0), (0, 0)
+    intv_out = (0, 0)
+
+    shr_A, shr_B, H1, H2, demote, scale_out = self.getShrTreeSumAndDemoteParamsForMul(bitwidth_in_A, scale_in_A, bitwidth_in_B, scale_in_B, bitwidth_temp, scale_temp, bitwidth_out, scale_out, Hf * Wf * CinF)
+
+ # The helper method getShrTreeSumAndDemoteParamsForMul can be directly used to infer scales for multiplication outputs, and also compute the parameters which the fixed point implementation would expect for multiplication. There are many such helper methods, check out the node for MatAdd for addition.
+ 
+    shr_A = self.formatShr(shr_A)
+    shr_B = self.formatShr(shr_B)
+ 
+    expr_in_A.inputVar = False
+    expr_in_B.inputVar = False
+    expr_out.inputVar = False
+    expr_treeSum.inputVar = False
+  
+    if forFixed():
+        self.varsForBitwidth[expr_treeSum.idf] = bitwidth_temp
+ 
+    comment = IR.Comment('conv(%s, %s)' %(expr_in_A.idf,expr_in_B.idf), self.counter_inst+1)
+    self.allDepths[self.counter_inst+1] = self.curDepth
+ 
+    bitwidth_mul = self.getTempBitwidth(bitwidth_in_A, bitwidth_in_B, "mul")
+    if self.vbwEnabled:
+        self.varsForBitwidth[expr_treeSum.idf] = bitwidth_mul
+ 
+# The temporary buffer variables are introduced by the compiler, and are not profiled by the input code. Their bitwidths are set using the values of bitwidths of the input
+
+    argMap = {
+        expr_in_A: "A",
+        expr_in_B: "B",
+        expr_out: "C",
+        expr_treeSum: "tmp",
+        IR.Int(N): "N",
+        IR.Int(H): "H",
+        IR.Int(W): "W",
+        IR.Int(Cin): "CIN",
+        IR.Int(Hf): "HF",
+        IR.Int(Wf): "WF",
+        IR.Int(CinF): "CINF",
+        IR.Int(CoutF): "COUTF",
+        IR.Int(type_out.shape[1]): "HOUT",
+        IR.Int(type_out.shape[2]): "WOUT",
+        IR.Int(node.padding[0]): "HPADL",
+        IR.Int(node.padding[1]): "HPADR",
+        IR.Int(node.padding[2]): "WPADL",
+        IR.Int(node.padding[3]): "WPADR",
+        IR.Int(node.stride[0]): "HSTR",
+        IR.Int(node.stride[1]): "WSTR",
+        IR.Int(node.dilation[0]): "HDL",
+        IR.Int(node.dilation[1]): "WDL",
+        IR.Int(G): "G",
+        shr_A: "shrA",
+        shr_B: "shrB",
+        IR.Int(H1): "H1",
+        IR.Int(H2): "H2"
+    }
+
+    if self.vbwEnabled:
+        argMap[IR.Int(demote)] = "demote"
+
+    if not self.vbwEnabled:
+        funcCall = IR.FuncCall("Convolution", argMap) #, {expr_treeSum.idf: type_treeSum})
+    else:
+        funcCall = IR.FuncCall("Convolution" + ("<int%d_t, int%d_t, int%d_t, int%d_t>"%(bitwidth_in_A, bitwidth_in_B, bitwidth_mul, bitwidth_out)), argMap)
+ 
+ # The argMap variable holds the arguments of the function call. Each key corresponds to one argument, the value is not required but is added for reference
+ 
+    self.counter_inst += 1
+    self.updateLiveRange([expr_in_A, expr_in_B, expr_out, expr_treeSum])
+ 
+ # The above updateLiveRange method updates the live ranges of the variable being alive. Refer to Section 6 in the paper about Memory Management
+
+    profile = IR.FuncCall("Profile4", {
+        expr_out: "Var",
+        IR.Int(N): "I",
+        IR.Int(type_out.shape[1]): "J",
+        IR.Int(type_out.shape[2]): "K",
+        IR.Int(CoutF * G): "L",
+        IR.String(expr_out): "VarName"
+    })
+
+    if forFloat():
+        self.independentVars.append(expr_out.idf)
+
+ # For convolution, the output variable is profiled from the data to infer it's scale. For any variable which needs to be profiled, the above function call is added (it gets added to the floating point code), If a variable's output scale is known (for example sigmoid's output is always bounded by +-1, so it need not be profiled and the output scale can be set directly)
+
+    prog_conv = IR.Prog([comment, funcCall, profile] if forFloat() and self.ddsEnabled else [comment, funcCall])
+
+    prog_out = IRUtil.concatPrograms(prog_in_A, prog_in_B, prog_conv)
+
+# Update context for output variable
+    self.varDeclarations[expr_out.idf] = type_out
+    self.varScales[expr_out.idf] = scale_out
+    self.varIntervals[expr_out.idf] = intv_out
+    
+    self.varDeclarations[expr_treeSum.idf] = type_treeSum
+    self.varScales[expr_treeSum.idf] = scale_temp
+    self.varIntervals[expr_treeSum.idf] = (0, 0)
+
+# varDeclarations variable is to make a list of variables which need to be instantiated in the output code
+
+    self.log.print(comment.msg)
+    self.log.print("\tInput1: scale = %d, interval = [%d, %d]" % ((self.varScales[expr_in_A.idf],) + self.varIntervals[expr_in_A.idf]))
+    self.log.print("\tInput2: scale = %d, interval = [%d, %d]" % ((self.varScales[expr_in_B.idf],) + self.varIntervals[expr_in_B.idf]))
+    self.log.print("\tOutput: scale = %d, interval = [%d, %d]" % ((self.varScales[expr_out.idf],) + self.varIntervals[expr_out.idf]))
+
+    return (prog_out, expr_out)
+```
+
+### STEP 7: Use the operator
+
+After all these steps, one can simply use the operator in the input Shiftry code, and the compiler will handle the rest.
--- a/tools/SeeDot/faceDetection.md
+++ b/tools/SeeDot/faceDetection.md
@ -0,0 +1,366 @@
+# Quantized Face Detection using SeeDot
+
+Face detection using SeeDot can be performed on the `face-2` and `face-4` datasets, which are subsets of the [SCUT Head Part B](https://github.com/HCIILAB/SCUT-HEAD-Dataset-Release) dataset.
+
+Face detection is a regression problem that involves an image input with (or without) faces and bounding boxes superimposed around the faces as output.
+Face detection is supported for x86 and ARM Cortex-M3 target devices.  
+
+Note: 
+1. This README has been tested with **Python 3.7.9**. Quantization with SeeDot requires **GCC** *version 8 or higher*.
+2. The dataset `face-2` corresponds to the model **RPool_Face_QVGA_monochrome** and; `face-4` corresponds to the model **RPool_Face_M4**.
+
+## Training Face Detection 
+
+Run the following commands to download the training data for face detection and train the model. This trained model would be used for quantized face detection.
+
+### Setup and Downloading the Dataset
+
+Assuming that the current directory is `EdgeML/tools/seedot`, please run the following commands. 
+
+Create directories for SeeDot's input and install python libraries used by SeeDot (which will be generated at the end of this section):
+```
+    cd ../../tf/
+    
+    # For system with CPU only
+    pip install -r requirements-cpu.txt
+
+    # For systems with CPU+GPU
+    pip install -r requirements-gpu.txt
+
+    # For both
+    pip install -e .
+    cd ../tools/SeeDot/
+
+    mkdir -p model/rnnpool/face-2/
+    mkdir -p model/rnnpool/face-4/
+
+    mkdir -p datasets/rnnpool/face-2/
+    mkdir -p datasets/rnnpool/face-4/
+```
+
+Install the python libraries: 
+```
+    cd ../../pytorch
+
+    # For a CPU only system
+    pip install -r requirements-cpu.txt
+
+    # For a CPU+GPU system
+    pip install -r requirements-gpu.txt
+
+    #For Both
+    pip install -e .
+```
+
+Next, please visit the [PyTorch website](https://pytorch.org/get-started/locally/) and install PyTorch version 1.7.1 as per your system requirements.
+Here, we are using PyTorch version 1.7.1 with CUDA 11.
+
+```
+    pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
+
+```
+Then, continue running the following commands:
+```
+    # For a CPU+GPU system
+    pip install -e edgeml_pytorch/cuda/
+
+    # For both
+    cd ../examples/pytorch/vision/Face_Detection/
+    pip install -r requirements.txt
+```
+
+Steps to download the dataset: 
+
+Note: The datasets can be stored in any location outside the repository as well. 
+Here, let's assume that the dataset will be downloaded to `/mnt/` and the current directory be represented by the environment variable *CUR_DIR*. 
+
+```
+    export CUR_DIR=$(pwd)
+    cd /mnt/
+    mkdir -p WIDER_FACE/
+```
+
+Download WIDER face dataset images and annotations from [http://shuoyang1213.me/WIDERFACE/](http://shuoyang1213.me/WIDERFACE/) and place it in the `WIDER_FACE` folder created above. Now the `WIDER_FACE` folder must contain `WIDER_train.zip`, `WIDER_test.zip`, `WIDER_val.zip`, `wider_face_split.zip`.
+
+Download the `SCUT Head Part B` dataset from the [drive link](https://drive.google.com/file/d/1LZ_KlTPStDEcqycfqUkDkqQ-aNMMC3cl/view) and place it in `/mnt` folder (the dataset directory). So the dataset directory should contain `SCUT_HEAD_Part_B.zip`.
+
+Now, decompress the datasets and add the `DATA_HOME` environment variable (for the training script to find the dataset) using the following commands:
+```
+    cd WIDER_FACE/
+    unzip WIDER_train.zip
+    unzip WIDER_val.zip
+    unzip WIDER_test.zip
+    unzip wider_face_split.zip
+    cd ../
+
+    unzip SCUT_HEAD_Part_B.zip
+    export DATA_HOME=$(pwd) # For the scripts to find the datasets
+
+    cd $CUR_DIR # To go back to Face_Detection directory
+    
+    # Data pre-processing
+    IS_QVGA_MONO=1 python prepare_wider_data.py 
+```
+
+### Training
+
+From here, we have two model options. For `face-2`, we use the model, **RPool_Face_QVGA_monochrome** and for `face-4`, we use the model **RPool_Face_M4**.
+
+Note: The training script has the arguments `--cuda` and `--multigpu` which have to be set `True` or `False` based on the system configuration. (In this README, both have been set to `True` and; `cuda` and `multigpu` arguments always have to be set to `True` or `False` together). 
+
+To start training:
+1.  For `face-2`:
+    ```
+        IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_QVGA_monochrome --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000
+
+        # In case the training has to be stopped prematurely, then it can be resumed using the following command
+        # IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_QVGA_monochrome --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 --resume weights/RPool_Face_QVGA_monochrome_best_state.pth
+    ```
+2. For `face-4`: 
+    ```
+        IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_M4 --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000
+
+        # In case the training has to be stopped prematurely, then it can be resumed using the following command
+        # IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_M4 --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 --resume weights/RPool_Face_M4_best_state.pth
+    ```
+
+This will train the model on the **WIDER face** dataset. Now, to fine-tune the model on **SCUT Head Part B** dataset, run the following commands. 
+
+1.  For `face-2`:
+    ```
+        IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_QVGA_monochrome --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 --resume weights/RPool_Face_QVGA_monochrome_best_state.pth --finetune True
+
+    ```
+2. For `face-4`: 
+    ```
+        IS_QVGA_MONO=1 python train.py --batch_size 64 --model_arch RPool_Face_M4 --cuda True --multigpu True --save_folder weights/ --epochs 300 --save_frequency 5000 --resume weights/RPool_Face_M4_best_state.pth --finetune True
+    ```
+
+### Generating SeeDot's Input
+
+Now that the model is trained, we need to generate the model traces and convert them to SeeDot format. For that, we have to create a subset of the SCUT Head B dataset which will be used for quantization. 
+
+Run the following commands:
+1. Creating a subset of `SCUT_HEAD_Part_B`:
+    ```
+        mkdir images/
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00009.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00052.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00082.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00101.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00112.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00170.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00195.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00376.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00398.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00601.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00675.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00735.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00973.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_02378.jpg images/ && \
+        cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_02396.jpg images/
+    ```
+2. Obtaining traces:
+    1. For `face-2`:
+        ```
+            IS_QVGA_MONO=1 python eval.py --model_arch RPool_Face_QVGA_monochrome --model ./weights/RPool_Face_QVGA_monochrome_best_state.pth --image_folder images/ --save_dir results/ --thresh 0.5 --multigpu True --save_traces True
+        ```
+    2. For `face-4`:
+        ```
+            IS_QVGA_MONO=1 python eval.py --model_arch RPool_Face_M4 --model ./weights/RPool_Face_M4_best_state.pth --image_folder images/ --save_dir results/ --thresh 0.5 --multigpu True --save_traces True
+        ```
+3. Converting the traces to SeeDot format:
+    1. For `face-2`:
+        ```
+            IS_QVGA_MONO=1 python convert_RPool_Face_to_SeeDot.py --model_arch RPool_Face_QVGA_monochrome --model ./weights/RPool_Face_QVGA_monochrome_best_state.pth
+        ```
+    2. For `face-4`:
+        ```
+            IS_QVGA_MONO=1 python convert_RPool_Face_to_SeeDot.py  --model_arch RPool_Face_M4 --model ./weights/RPool_Face_M4_best_state.pth
+        ```
+This will store SeeDot's input to `EdgeML/tools/SeeDot/model/rnnpool/face-2/`, `EdgeML/tools/SeeDot/datasets/rnnpool/face-2/` or; `EdgeML/tools/SeeDot/model/rnnpool/face-4/`, `EdgeML/tools/SeeDot/datasets/rnnpool/face-4/` respectively.
+
+### Setting up SeeDot
+
+To finish setting up SeeDot, run the following commands:
+
+```
+    cd ../../../../tools/SeeDot
+
+    # For face-2
+    python fixSeeDotInput.py --seedot_file seedot/compiler/input/rnnpool-face-2.sd --model_dir model/rnnpool/face-2/ --dataset_dir datasets/rnnpool/face-2/ -n 18000
+    
+    # For face-4
+    python fixSeeDotInput.py --seedot_file seedot/compiler/input/rnnpool-face-4.sd --model_dir model/rnnpool/face-4/ --dataset_dir datasets/rnnpool/face-4/ -n 18000
+```
+ 
+
+## Running SeeDot
+
+### Run Face Detection for x86
+To run face detection using the SeeDot quantizer on x86 devices, run the following command: 
+
+1. For `face-2`:
+    ```
+        python SeeDot-dev.py -a rnnpool -e fixed -m disagree -d face-2 -dt testing -t x86 -n 18000 
+    ```
+
+2. For `face-4`:
+    ```
+        python SeeDot-dev.py -a rnnpool -e fixed -m disagree -d face-4 -dt testing -t x86 -n 18000 
+    ```
+
+
+The non-optional arguments used in the above commands are:
+```
+    Argument            Value       Description
+    
+    -a, --algo          rnnpool     Face detection problem uses the 'rnnpool' machine learning
+                                    algorithm.
+
+    -e, --encoding      fixed/      The encoding to use for face detection. 
+                        float       The options are 'float' and 'fixed'.
+
+    -m, --metric        disagree    The metric used to measure correctness of prediction.
+                                    This is used for quantization. 
+
+    -d, --dataset       face-2/     The dataset to use for face detection. 
+                        face-4      The options are 'face-2' and 'face-4'.
+    
+    -dt, --datesetType  training/   The type of the dataset to use for quantization. 
+                        testing     The options are 'training' and 'testing'.
+                                    Default is 'testing'.
+
+    -t, --target        x86         The target device for which the code has to be generated. 
+                                    'x86' in this case.
+
+    -n, --numOutputs    18000       The size of the output vector of rnnpool face detection
+                                    algorithm.
+                      
+```
+The output will be stored in the `temp/` directory by default. Use the `-o <destination folder>` flag to store the output to an already existing location. 
+
+### Run Face Detection for M3
+To run face detection using the SeeDot quantizer for M3 device, run the command: 
+
+```
+    python SeeDot-dev.py -a rnnpool -e fixed -m disagree -d face-2 -dt testing -t m3 -n 18000 
+```
+for the `face-2` dataset, and:
+```
+    python SeeDot-dev.py -a rnnpool -e fixed -m disagree -d face-4 -dt testing -t m3 -n 18000 
+```
+for the `face-4` dataset.
+
+Note: The target device specified using the `-t` argument has the value `m3` in this case. See above for a detailed argument description.
+
+The output will be stored in the `m3dump/` directory by default. Use the `-o <destination folder>` flag to store the output to an already existing location. 
+
+## Run SeeDot's Output - Quantized Prediction Code
+
+This section discusses the execution of the quantized fixed-point model generated by SeeDot on 
+random images from the `SCUT Head Part B` dataset.
+
+### Library Installation
+This section requires installing `pytorch`, `cv2`, `easydict`, and `six` python libraries. Run the following commands for installing them:
+```
+    pip install opencv-python
+    pip install easydict
+    pip install six
+```
+
+### Running Face Detection on Image Input
+
+By default, the quantized `x86` codes are stored in the `temp/` directory. 
+
+To run the model, we use the `Predictor` executable in the `temp/Predictor`directory.
+
+First, we need to download an image to test the quantized `Predictor`.
+
+For that, we follow the below steps:
+```
+    cd faceDetection/
+    mkdir -p images/ && cd images/
+    cp ${DATA_HOME}/SCUT_HEAD_Part_B/JPEGImages/PartB_00007.jpg ./
+    cd ..
+    cp -r ../../../examples/pytorch/vision/Face_Detection/layers/ .
+    cp -r ../../../examples/pytorch/vision/Face_Detection/utils/ .
+    mkdir -p data/
+    cp -r ../../../examples/pytorch/vision/Face_Detection/data/choose_config.py data/
+    cp -r ../../../examples/pytorch/vision/Face_Detection/data/config_qvga.py data/
+```
+
+Note: The last 5 commands copy scripts from another location of this repository, to help in the conversion of images to text files and vice-versa. 
+
+This will copy `PartB_00007.jpg` to the `images/` directory. Users can use their own images as well instead of the above image.
+Multiple images can be added to the `images/` directory, which will be processed in a single execution.  
+
+Note that we run all of the following python scripts in the `SeeDot/faceDetection` directory. 
+
+To create the processed file used by `Predictor` from the set of images, we run the following command:
+
+```
+    python scale_image.py --image_dir images/ --out_dir input/
+```
+
+The script `scale_image.py` reads all the images in the `images/` directory (which is the default for the `image_dir` field). 
+And outputs the files `X.csv` and `Y.csv` in the `input/` directory (which is the default for the `out_dir` field). 
+
+Now, we copy the executable to this directory and run it using the below commands:
+```
+    cp ../temp/Predictor/Predictor .
+    mkdir -p input/
+    mkdir -p output/
+    ./Predictor fixed testing regression 18000 False
+```
+
+For running the executable, specifying all arguments is necessary. The arguments' descriptions are:
+```
+    Argument        Description
+
+    fixed           The encoding. This is dependent on the 'encoding' argument 
+                    given to SeeDot. 
+    
+    testing         This is the argument in SeeDot specified by 'datasetType' field.
+    
+    regression      This is the type of problem ('classification' and 'regression'). 
+                    This argument is inferred by SeeDot automatically.
+    
+    18000           This is equal to the argument specified in SeeDot's execution using
+                    the 'numOutputs' field.
+```
+
+The executable is copied from `temp/Predictor` because that is the default output directory of SeeDot for target **x86**.
+
+This executable takes its input from the `input/` directory (hence the use of `input` as the default for `out_dir` argument of `scale_image.py`). 
+
+`X.csv` contains the floating-point input values; and `Y.csv`, consists of the correct integer labels (floating-point outputs) in case of classification (regression). However, since we are only concerned with the predicted output here, 
+the contents of `Y.csv` are irrelevant. 
+
+In the case of `face-2` and `face-4`, the input layer size is *76800* and the output 
+has *18000* values (the number of columns in `X.csv` and `Y.csv` respectively). 
+
+The output of `Predictor` is stored to `trace.txt`.
+
+The executable dumps the execution stats and accuracy results in the `output/` directory. While it must exist, this directory is irrelevant to our discussion.
+
+
+
+Now we construct the bounding boxes from `trace.txt`.
+
+For that we run the below commands:
+```
+    python eval_fromquant.py --save_dir results/ --image_dir images/ --trace_file trace.txt
+```
+
+The images on which the prediction was carried out are read from the `images/` directory (which is the default for the `image_dir` field). The `image_dir` argument must be the same for both `scale_image.py` and `eval_fromquant.py`.
+
+The images with bounding boxes drawn are stored in the `results/` directory (which is the default for the `save_dir` field).
+
+`trace_file` field takes the location of `trace.txt` as the argument (which is `./trace.txt` by default).
+
+
+
+
+
+Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT license.
--- a/tools/SeeDot/faceDetection/eval_fromquant.py
+++ b/tools/SeeDot/faceDetection/eval_fromquant.py
@ -0,0 +1,124 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import torch
+import torch.nn as nn
+import torch.utils.data as data
+import torch.backends.cudnn as cudnn
+import torchvision.transforms as transforms
+
+import os
+import time
+import argparse
+import numpy as np
+from PIL import Image
+import cv2
+
+os.environ['IS_QVGA_MONO'] = '1'
+from data.choose_config import cfg
+cfg = cfg.cfg
+
+from utils.augmentations import to_chw_bgr
+from layers import *
+
+
+parser = argparse.ArgumentParser(description='Face Detection from quantized model\'s output.')
+parser.add_argument('--save_dir', type=str, default='results/',
+                    help='Directory for detect result')
+
+parser.add_argument('--thresh', default=0.45, type=float,
+                    help='Final confidence threshold')
+
+parser.add_argument('--image_dir', default="images/", type=str, help='Folder containing image(s)')
+parser.add_argument('--trace_file', default="trace.txt", type=str, help='File containing output traces')
+parser.add_argument('--scale', type=str, default='0', help='Scale for the output image (use scaleForY variable value from the generated m3dump/scales.h file here)')
+
+
+args = parser.parse_args()
+
+if not os.path.exists(args.save_dir):
+    os.makedirs(args.save_dir)
+
+
+def detect(loc, conf, img_path, thresh):
+    features_maps = []
+    for i in range(len(loc)):
+        feat = []
+        feat += [loc[i].size(1), loc[i].size(2)]
+        features_maps += [feat]
+
+    priorbox = PriorBox(torch.Size([240, 320]), features_maps, cfg)
+    
+    priors = priorbox.forward()
+    softmax = nn.Softmax(dim=-1)
+
+    loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
+    conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
+   
+    output = detect_function(cfg,
+            loc.view(loc.size(0), -1, 4),                   
+            softmax(conf.view(conf.size(0), -1,
+                                   2)),     
+            priors               
+        )
+    detections = output.data
+    # import pdb;pdb.set_trace()
+    
+
+    img = cv2.imread(img_path, cv2.IMREAD_COLOR)
+    img = cv2.resize(img, (1280, 960))
+
+    scale = torch.Tensor([img.shape[1], img.shape[0],
+                          img.shape[1], img.shape[0]])
+
+    for i in range(detections.size(1)):
+        j = 0
+        while detections[0, i, j, 0] >= thresh:
+            score = detections[0, i, j, 0]
+            pt = (detections[0, i, j, 1:] * scale).cpu().numpy()
+            left_up, right_bottom = (pt[0], pt[1]), (pt[2], pt[3])
+            j += 1
+            cv2.rectangle(img, left_up, right_bottom, (0, 0, 255), 5)
+            conf = "{:.3f}".format(score)
+            point = (int(left_up[0]), int(left_up[1] - 5))
+            cv2.putText(img, conf, point, cv2.FONT_HERSHEY_COMPLEX,
+                       1, (0, 255, 0), 1)
+
+    # t2 = time.time()
+    # print('detect:{} timer:{}'.format(img_path, t2 - t1))
+    cv2.imwrite(os.path.join(args.save_dir, os.path.basename(img_path)), img)
+    
+
+if __name__ == '__main__':
+    img_path = args.image_dir
+    img_list = [os.path.join(img_path, x)
+                for x in os.listdir(img_path)]
+    a = open(args.trace_file)
+    l = a.readlines()
+    preds = np.zeros((len(img_list), 18000))
+
+    j = -1
+    for i in l:
+        j += 1
+        nums = i.strip().split(' ')
+        c = []
+        for n in nums:
+            c.append(float(n) * 2 ** (int(args.scale)))
+        preds[j,:] = np.array(c)
+
+    countr=0
+    for path in sorted(img_list):
+        temp_preds = preds[countr]
+        conf = [torch.Tensor(temp_preds[0:2400].reshape(1,30,40,2)),
+        torch.Tensor(temp_preds[2400:4800].reshape(1,30,40,2)),
+        torch.Tensor(temp_preds[4800:5400].reshape(1,15,20,2)),
+        torch.Tensor(temp_preds[5400:6000].reshape(1,15,20,2))]
+
+        loc = [torch.Tensor(temp_preds[6000:10800].reshape(1,30,40,4)),
+        torch.Tensor(temp_preds[10800:15600].reshape(1,30,40,4)),
+        torch.Tensor(temp_preds[15600:16800].reshape(1,15,20,4)),
+        torch.Tensor(temp_preds[16800:18000].reshape(1,15,20,4))]
+    # print(sorted(img_list))
+        detect(loc, conf, path, args.thresh)
+        countr+=1
+
--- a/tools/SeeDot/faceDetection/scale_image.py
+++ b/tools/SeeDot/faceDetection/scale_image.py
@ -0,0 +1,69 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import torch
+import argparse
+import cv2
+import numpy as np
+from PIL import Image
+import os 
+
+os.environ['IS_QVGA_MONO'] = '1'
+from data.choose_config import cfg
+
+
+cfg = cfg.cfg
+
+parser = argparse.ArgumentParser(description='Generating input to quantized face detection code')
+parser.add_argument('--image_dir', default="images", type=str, help='Folder containing image(s)')
+parser.add_argument('--out_dir', default="input", type=str, help='Folder containing the CSV files')
+args = parser.parse_args()
+
+if not os.path.exists(args.out_dir):
+    os.makedirs(args.out_dir)
+
+
+img_list = [os.path.join(args.image_dir, x)
+                for x in os.listdir(args.image_dir)]
+
+xoutfile = open(os.path.join(args.out_dir, "X.csv"), "w")
+
+for image_path in sorted(img_list): 
+    img = Image.open(image_path)
+    img = img.convert('RGB')
+    img = np.array(img)
+    scale = 1
+
+    max_im_shrink_x = 320 / (img.shape[1])
+    max_im_shrink_y = 240 / (img.shape[0])
+
+    image = cv2.resize(img, None, None, fx=max_im_shrink_x,
+                    fy=max_im_shrink_y, interpolation=cv2.INTER_LINEAR)
+
+    if len(image.shape) == 3:
+        image = np.swapaxes(image, 1, 2)
+        image = np.swapaxes(image, 1, 0)
+    # RBG to BGR
+    x = image[[2, 1, 0], :, :]
+
+    x = x.astype('float32')
+    x -= cfg.img_mean
+    x = x[[2, 1, 0], :, :]
+    x = 0.299 * x[0] + 0.587 * x[1] + 0.114 * x[2]
+    x /= scale
+    x = np.rint(x).astype(int)
+
+    for i in range(240):
+        for j in range(320):
+            if i == 239 and j == 319:
+                xoutfile.write(str(x[i, j]) + "\n")
+            else:
+                xoutfile.write(str(x[i, j]) + ', ')
+
+youtfile = open(os.path.join(args.out_dir, "Y.csv"), "w")
+for _ in range(len(img_list)):
+    for i in range(18000):
+        if i == 17999:
+            youtfile.write("0\n")
+        else:
+            youtfile.write("0, ")
--- a/tools/SeeDot/fixSeeDotInput.py
+++ b/tools/SeeDot/fixSeeDotInput.py
@ -0,0 +1,128 @@
+#! /usr/bin/env python
+
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+# This file is a fixer that fixes the inaccuracy created 
+# during multiple training instances of the same model  .
+
+import numpy as np
+import os, sys
+import argparse
+import re
+
+
+def parse():
+    parser = argparse.ArgumentParser(description='Modify SeeDot input file')
+    parser.add_argument('--seedot_file', type=str,metavar='',
+                        help='path .sd file (including file name)')
+    parser.add_argument('--model_dir', type=str,metavar='',
+                        help='path to model files directory')
+    parser.add_argument('--dataset_dir', type=str,metavar='',
+                        help='path to data files directory (the directory with train.npy and test.npy)')
+    parser.add_argument("-n", "--numOutputs", type=int, metavar='',
+                        help='The number of outputs that the model under consideration produces', default=1)
+    parser.add_argument('--normalise_data', action='store_true',
+                    help='Normalise the input train and test files.')
+
+    return parser.parse_args()
+
+
+def readModelWeights(model_dir, dataset_dir, numOutputs, normalise_data):
+    filelist = os.listdir(os.path.join(os.getcwd(), model_dir))
+    cur_dir = os.getcwd()
+    os.chdir(model_dir)
+    filelist = [x for x in filelist if x[-4:] == '.npy']
+    weight_min_max_dict = {}
+    for filename in filelist:
+        f = np.load(filename).flatten()
+        if (len(f) == 1):
+            m1 = 1.0/(1.0 + np.exp(-1*f[0]))
+            weight_min_max_dict[filename[:-4]] = [m1]
+        else:
+            m1 = np.min(f)
+            m2 = np.max(f)
+            weight_min_max_dict[filename[:-4]] = [m1, m2]
+    
+    os.chdir(cur_dir)
+    os.chdir(dataset_dir)
+
+    train = np.load("train.npy")
+    Xtrain = train[:, numOutputs:]
+    
+    test = np.load("test.npy")
+    Xtest = test[:, numOutputs:]
+    
+    if normalise_data:
+        mean = np.mean(Xtrain, 0)
+        std = np.std(Xtrain, 0)
+        std[std[:] < 0.000001] = 1
+        
+        Xtrain = (Xtrain - mean) / std
+        Xtest = (Xtest - mean) / std
+
+    m1 = np.min(Xtrain)
+    m2 = np.max(Xtrain)
+
+    m1 = min(m1, np.min(Xtest))
+    m2 = min(m2, np.max(Xtest))
+    weight_min_max_dict['X'] = [m1, m2]
+    
+    if normalise_data:
+        train[:, numOutputs:] = Xtrain
+        test[:, numOutputs:] = Xtest
+        
+        np.save("train.npy", train)
+        np.save("test.npy", test)
+
+    os.chdir(cur_dir)
+
+    return weight_min_max_dict
+
+def getVar(line, weights_dict):
+    replace = False
+    new_line = None
+    if line.count('=') == 1:
+        left, right = line.split('=')
+        left = left.lstrip().rstrip()
+        var = left.split(' ')[-1].split('\t')[-1]
+        right = right.lstrip().rstrip()
+        if var in weights_dict.keys():
+            replace = True
+            weights = weights_dict[var]
+            if len(weights) == 1:
+                new_line = "let " + var + " = " + "%.20f"%(weights[0]) + " in"
+            else:
+                shape = line[line.find('('):line.find(')')+1]
+                new_line = "let " + var + " = " + shape + " in ["  +\
+                         "%.20f"%(weights[0]) + ", " + "%.20f"%(weights[1]) + "] in"
+    return  replace, new_line
+
+
+def writeToInputDotSD(file, dir):
+    os.chdir(dir)
+    f = open("input.sd", "w")
+
+    for i in range(len(file)):
+        f.write(file[i] + "\n")
+    f.close()
+
+
+def run(args):
+    input_file = open(args.seedot_file).read().split("\n")
+    
+    model_weights_dict = readModelWeights(args.model_dir, args.dataset_dir, args.numOutputs, args.normalise_data)
+    
+    for i in range(len(input_file)):
+        line = input_file[i]
+        replace, new_line = getVar(line, model_weights_dict)
+        if replace:
+            input_file[i] = new_line
+            # print(line + " | " + new_line)
+    writeToInputDotSD(input_file, args.model_dir)
+
+
+
+if __name__ == '__main__':
+    args = parse()
+    run(args)
--- a/tools/SeeDot/legacy_scales.csv
+++ b/tools/SeeDot/legacy_scales.csv
@ -0,0 +1,73 @@
+bonsai,int8,cifar-binary,63.61,-5
+bonsai,int8,cr-binary,68.399,-5
+bonsai,int8,cr-multiclass,8.564,-6
+bonsai,int8,curet-multiclass,22.024,-6
+bonsai,int8,letter-multiclass,40.14,-4
+bonsai,int8,mnist-binary,83.88,-5
+bonsai,int8,mnist-multiclass,77.97,-5
+bonsai,int8,usps-binary,63.528,-6
+bonsai,int8,usps-multiclass,83.906,-5
+bonsai,int8,ward-binary,79.13,-4
+bonsai,int16,cifar-binary,75.49,-10
+bonsai,int16,cr-binary,74.019,-11
+bonsai,int16,cr-multiclass,10.339,-12
+bonsai,int16,curet-multiclass,43.835,-11
+bonsai,int16,letter-multiclass,64.76,-10
+bonsai,int16,mnist-binary,95.94,-10
+bonsai,int16,mnist-multiclass,93.36,-11
+bonsai,int16,usps-binary,94.619,-11
+bonsai,int16,usps-multiclass,91.679,-11
+bonsai,int16,ward-binary,95.132,-8
+bonsai,int32,cifar-binary,75.63,-19
+bonsai,int32,cr-binary,73.913,-23
+bonsai,int32,cr-multiclass,10.809,-18
+bonsai,int32,curet-multiclass,45.189,-24
+bonsai,int32,letter-multiclass,64.92,-24
+bonsai,int32,mnist-binary,96.01,-20
+bonsai,int32,mnist-multiclass,93.43,-25
+bonsai,int32,usps-binary,94.519,-15
+bonsai,int32,usps-multiclass,91.978,-21
+bonsai,int32,ward-binary,95.236,-20
+bonsai,float,cifar-binary,75.48,
+bonsai,float,cr-binary,73.86,
+bonsai,float,cr-multiclass,10.757,
+bonsai,float,curet-multiclass,45.545,
+bonsai,float,letter-multiclass,64.88,
+bonsai,float,mnist-binary,95.98,
+bonsai,float,mnist-multiclass,93.38,
+bonsai,float,usps-binary,94.619,
+bonsai,float,usps-multiclass,91.978,
+bonsai,float,ward-binary,95.287,
+protonn,int16,cifar-binary,75.4,-6
+protonn,int16,cr-binary,70.785,-2
+protonn,int16,cr-multiclass,31.593,-6
+protonn,int16,curet-multiclass,50.463,-2
+protonn,int16,letter-multiclass,76.72,-6
+protonn,int16,mnist-binary,94.28,-6
+protonn,int16,mnist-multiclass,91.44,-6
+protonn,int16,usps-binary,94.17,-6
+protonn,int16,usps-multiclass,91.978,-1
+protonn,int16,ward-binary,94.2,-2
+protonn,int32,cifar-binary,76.46,-13
+protonn,int32,cr-binary,73.171,-14
+protonn,int32,cr-multiclass,32.846,-10
+protonn,int32,curet-multiclass,54.455,-15
+protonn,int32,letter-multiclass,84.06,-17
+protonn,int32,mnist-binary,95.22,-20
+protonn,int32,mnist-multiclass,92.03,-19
+protonn,int32,usps-binary,94.32,-13
+protonn,int32,usps-multiclass,92.576,-13
+protonn,int32,ward-binary,94.925,-14
+protonn,float,cifar-binary,76.45,
+protonn,float,cr-binary,72.906,
+protonn,float,cr-multiclass,32.742,
+protonn,float,curet-multiclass,54.526,
+protonn,float,letter-multiclass,84.02,
+protonn,float,mnist-binary,95.21,
+protonn,float,mnist-multiclass,92.06,
+protonn,float,usps-binary,94.27,
+protonn,float,usps-multiclass,92.526,
+protonn,float,ward-binary,94.873,
+lenet,int16,cifar-multiclass,,
+lenet,int32,cifar-multiclass,,
+lenet,float32,cifar-multiclass,,
--- a/tools/SeeDot/requirements.txt
+++ b/tools/SeeDot/requirements.txt
@ -0,0 +1,7 @@
+antlr4-python3-runtime==4.7
+bokeh==2.1.1
+numpy==1.16.4
+scikit-learn==0.19.2
+scipy==1.1.0
+onnx==1.8.0
+tqdm==4.56.0
--- a/tools/SeeDot/seedot/.gitignore
+++ b/tools/SeeDot/seedot/.gitignore
@ -0,0 +1,334 @@
+## Ignore Visual Studio temporary files, build results, and
+## files generated by popular Visual Studio add-ons.
+##
+## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore
+
+# User-specific files
+*.rsuser
+*.suo
+*.user
+*.userosscache
+*.sln.docstates
+
+# User-specific files (MonoDevelop/Xamarin Studio)
+*.userprefs
+
+# Build results
+[Dd]ebug/
+[Dd]ebugPublic/
+[Rr]elease/
+[Rr]eleases/
+x64/
+x86/
+bld/
+[Bb]in/
+[Oo]bj/
+[Ll]og/
+
+# Visual Studio 2015/2017 cache/options directory
+.vs/
+# Uncomment if you have tasks that create the project's static files in wwwroot
+#wwwroot/
+
+# Visual Studio 2017 auto generated files
+Generated\ Files/
+
+# MSTest test Results
+[Tt]est[Rr]esult*/
+[Bb]uild[Ll]og.*
+
+# NUNIT
+*.VisualState.xml
+TestResult.xml
+
+# Build Results of an ATL Project
+[Dd]ebugPS/
+[Rr]eleasePS/
+dlldata.c
+
+# Benchmark Results
+BenchmarkDotNet.Artifacts/
+
+# .NET Core
+project.lock.json
+project.fragment.lock.json
+artifacts/
+
+# StyleCop
+StyleCopReport.xml
+
+# Files built by Visual Studio
+*_i.c
+*_p.c
+*_h.h
+*.ilk
+*.meta
+*.obj
+*.iobj
+*.pch
+*.pdb
+*.ipdb
+*.pgc
+*.pgd
+*.rsp
+*.sbr
+*.tlb
+*.tli
+*.tlh
+*.tmp
+*.tmp_proj
+*_wpftmp.csproj
+*.log
+*.vspscc
+*.vssscc
+.builds
+*.pidb
+*.svclog
+*.scc
+
+# Chutzpah Test files
+_Chutzpah*
+
+# Visual C++ cache files
+ipch/
+*.aps
+*.ncb
+*.opendb
+*.opensdf
+*.sdf
+*.cachefile
+*.VC.db
+*.VC.VC.opendb
+
+# Visual Studio profiler
+*.psess
+*.vsp
+*.vspx
+*.sap
+
+# Visual Studio Trace Files
+*.e2e
+
+# TFS 2012 Local Workspace
+$tf/
+
+# Guidance Automation Toolkit
+*.gpState
+
+# ReSharper is a .NET coding add-in
+_ReSharper*/
+*.[Rr]e[Ss]harper
+*.DotSettings.user
+
+# JustCode is a .NET coding add-in
+.JustCode
+
+# TeamCity is a build add-in
+_TeamCity*
+
+# DotCover is a Code Coverage Tool
+*.dotCover
+
+# AxoCover is a Code Coverage Tool
+.axoCover/*
+!.axoCover/settings.json
+
+# Visual Studio code coverage results
+*.coverage
+*.coveragexml
+
+# NCrunch
+_NCrunch_*
+.*crunch*.local.xml
+nCrunchTemp_*
+
+# MightyMoose
+*.mm.*
+AutoTest.Net/
+
+# Web workbench (sass)
+.sass-cache/
+
+# Installshield output folder
+[Ee]xpress/
+
+# DocProject is a documentation generator add-in
+DocProject/buildhelp/
+DocProject/Help/*.HxT
+DocProject/Help/*.HxC
+DocProject/Help/*.hhc
+DocProject/Help/*.hhk
+DocProject/Help/*.hhp
+DocProject/Help/Html2
+DocProject/Help/html
+
+# Click-Once directory
+publish/
+
+# Publish Web Output
+*.[Pp]ublish.xml
+*.azurePubxml
+# Note: Comment the next line if you want to checkin your web deploy settings,
+# but database connection strings (with potential passwords) will be unencrypted
+*.pubxml
+*.publishproj
+
+# Microsoft Azure Web App publish settings. Comment the next line if you want to
+# checkin your Azure Web App publish settings, but sensitive information contained
+# in these scripts will be unencrypted
+PublishScripts/
+
+# NuGet Packages
+*.nupkg
+# The packages folder can be ignored because of Package Restore
+**/[Pp]ackages/*
+# except build/, which is used as an MSBuild target.
+!**/[Pp]ackages/build/
+# Uncomment if necessary however generally it will be regenerated when needed
+#!**/[Pp]ackages/repositories.config
+# NuGet v3's project.json files produces more ignorable files
+*.nuget.props
+*.nuget.targets
+
+# Microsoft Azure Build Output
+csx/
+*.build.csdef
+
+# Microsoft Azure Emulator
+ecf/
+rcf/
+
+# Windows Store app package directories and files
+AppPackages/
+BundleArtifacts/
+Package.StoreAssociation.xml
+_pkginfo.txt
+*.appx
+
+# Visual Studio cache files
+# files ending in .cache can be ignored
+*.[Cc]ache
+# but keep track of directories ending in .cache
+!*.[Cc]ache/
+
+# Others
+ClientBin/
+~$*
+*~
+*.dbmdl
+*.dbproj.schemaview
+*.jfm
+*.pfx
+*.publishsettings
+orleans.codegen.cs
+
+# Including strong name files can present a security risk
+# (https://github.com/github/gitignore/pull/2483#issue-259490424)
+#*.snk
+
+# Since there are multiple workflows, uncomment next line to ignore bower_components
+# (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
+#bower_components/
+
+# RIA/Silverlight projects
+Generated_Code/
+
+# Backup & report files from converting an old project file
+# to a newer Visual Studio version. Backup files are not needed,
+# because we have git ;-)
+_UpgradeReport_Files/
+Backup*/
+UpgradeLog*.XML
+UpgradeLog*.htm
+ServiceFabricBackup/
+*.rptproj.bak
+
+# SQL Server files
+*.mdf
+*.ldf
+*.ndf
+
+# Business Intelligence projects
+*.rdl.data
+*.bim.layout
+*.bim_*.settings
+*.rptproj.rsuser
+
+# Microsoft Fakes
+FakesAssemblies/
+
+# GhostDoc plugin setting file
+*.GhostDoc.xml
+
+# Node.js Tools for Visual Studio
+.ntvs_analysis.dat
+node_modules/
+
+# Visual Studio 6 build log
+*.plg
+
+# Visual Studio 6 workspace options file
+*.opt
+
+# Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
+*.vbw
+
+# Visual Studio LightSwitch build output
+**/*.HTMLClient/GeneratedArtifacts
+**/*.DesktopClient/GeneratedArtifacts
+**/*.DesktopClient/ModelManifest.xml
+**/*.Server/GeneratedArtifacts
+**/*.Server/ModelManifest.xml
+_Pvt_Extensions
+
+# Paket dependency manager
+.paket/paket.exe
+paket-files/
+
+# FAKE - F# Make
+.fake/
+
+# JetBrains Rider
+.idea/
+*.sln.iml
+
+# CodeRush personal settings
+.cr/personal
+
+# Python Tools for Visual Studio (PTVS)
+__pycache__/
+*.pyc
+
+# Cake - Uncomment if you are using it
+# tools/**
+# !tools/packages.config
+
+# Tabs Studio
+*.tss
+
+# Telerik's JustMock configuration file
+*.jmconfig
+
+# BizTalk build output
+*.btp.cs
+*.btm.cs
+*.odx.cs
+*.xsd.cs
+
+# OpenCover UI analysis results
+OpenCover/
+
+# Azure Stream Analytics local run output
+ASALocalRun/
+
+# MSBuild Binary and Structured Log
+*.binlog
+
+# NVidia Nsight GPU debugger configuration file
+*.nvuser
+
+# MFractors (Xamarin productivity tool) working folder
+.mfractor/
+
+# Local History for Visual Studio
+.localhistory/
--- a/tools/SeeDot/seedot/Predictor/.gitignore
+++ b/tools/SeeDot/seedot/Predictor/.gitignore
@ -0,0 +1,5 @@
+input/
+output/
+
+info.txt
+input.sd
--- a/tools/SeeDot/seedot/Predictor/Makefile
+++ b/tools/SeeDot/seedot/Predictor/Makefile
@ -2,29 +2,38 @@
 # Licensed under the MIT license.

 CC=g++
-CFLAGS= -Wall -p -g -fPIC -O3 -std=c++11
+CFLAGS=-Wall -Wno-narrowing -p -g -fPIC -O3 -std=c++11 -pthread

-PREDICTOR_INCLUDES = bonsai_float_model.h \
-					datatypes.h \
-					library.h predictors.h \
-					profile.h \
-					protonn_float_model.h \
-					seedot_fixed_model.h
-				  
-PREDICTOR_OBJS = bonsai_float.o library.o \
-				main.o profile.o \
-				protonn_float.o \
-				seedot_fixed.o
+PREDICTOR_INCLUDES=datatypes.h \
+                   library_fixed.h \
+                   library_float.h \
+                   model_fixed.h \
+                   model_float.h \
+                   predictors.h \
+                   profile.h \
+                   vars_fixed.h \
+                   vars_float.h \
+
+PREDICTOR_OBJS=debug.o \
+               library_fixed.o \
+               library_float.o \
+               main.o \
+               profile.o \
+               seedot_fixed.o \
+               seedot_float.o \

 all: Predictor

-Predictor: bonsai_float.o library.o main.o profile.o protonn_float.o seedot_fixed.o
+Predictor: $(PREDICTOR_OBJS)
 	$(CC) -o $@ $^ $(CFLAGS)

-bonsai_float.o: bonsai_float.cpp $(PREDICTOR_INCLUDES)
+debug.o: debug.cpp $(PREDICTOR_INCLUDES)
 	$(CC) -c -o $@ $(CFLAGS) $<

-library.o: library.cpp $(PREDICTOR_INCLUDES)
+library_fixed.o: library_fixed.cpp $(PREDICTOR_INCLUDES)
+	$(CC) -c -o $@ $(CFLAGS) $<
+
+library_float.o: library_float.cpp $(PREDICTOR_INCLUDES)
 	$(CC) -c -o $@ $(CFLAGS) $<

 main.o: main.cpp $(PREDICTOR_INCLUDES)
@ -33,10 +42,10 @@ main.o: main.cpp $(PREDICTOR_INCLUDES)
 profile.o: profile.cpp $(PREDICTOR_INCLUDES)
 	$(CC) -c -o $@ $(CFLAGS) $<

-protonn_float.o: protonn_float.cpp $(PREDICTOR_INCLUDES) 
+seedot_fixed.o: seedot_fixed.cpp $(PREDICTOR_INCLUDES)
 	$(CC) -c -o $@ $(CFLAGS) $<

-seedot_fixed.o: seedot_fixed.cpp $(PREDICTOR_INCLUDES) 
+seedot_float.o: seedot_float.cpp $(PREDICTOR_INCLUDES)
 	$(CC) -c -o $@ $(CFLAGS) $<

 clean: 
--- a/tools/SeeDot/seedot/Predictor/Predictor.vcxproj
+++ b/tools/SeeDot/seedot/Predictor/Predictor.vcxproj
@ -0,0 +1,146 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <VCProjectVersion>15.0</VCProjectVersion>
+    <ProjectGuid>{FCF52402-0D2F-43E7-BB43-213252444552}</ProjectGuid>
+    <RootNamespace>Predictor</RootNamespace>
+    <WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v142</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v142</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v142</PlatformToolset>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v142</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>MultiByte</CharacterSet>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="Shared">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup />
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <SDLCheck>true</SDLCheck>
+      <ConformanceMode>true</ConformanceMode>
+    </ClCompile>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <SDLCheck>true</SDLCheck>
+      <ConformanceMode>true</ConformanceMode>
+    </ClCompile>
+    <Link>
+      <SubSystem>Console</SubSystem>
+      <StackReserveSize>20971520</StackReserveSize>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <SDLCheck>true</SDLCheck>
+      <ConformanceMode>true</ConformanceMode>
+    </ClCompile>
+    <Link>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>MaxSpeed</Optimization>
+      <FunctionLevelLinking>true</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <SDLCheck>true</SDLCheck>
+      <ConformanceMode>true</ConformanceMode>
+    </ClCompile>
+    <Link>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <SubSystem>Console</SubSystem>
+      <StackReserveSize>10485760</StackReserveSize>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemGroup>
+    <ClCompile Include="debug.cpp" />
+    <ClCompile Include="library_fixed.cpp" />
+    <ClCompile Include="library_float.cpp" />
+    <ClCompile Include="main.cpp" />
+    <ClCompile Include="profile.cpp" />
+    <ClCompile Include="seedot_fixed.cpp" />
+    <ClCompile Include="seedot_float.cpp" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="library_fixed.h" />
+    <ClInclude Include="datatypes.h" />
+    <ClInclude Include="library_float.h" />
+    <ClInclude Include="model_fixed.h" />
+    <ClInclude Include="model_float.h" />
+    <ClInclude Include="predictors.h" />
+    <ClInclude Include="profile.h" />
+    <ClInclude Include="vars_fixed.h" />
+    <ClInclude Include="vars_float.h" />
+  </ItemGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
--- a/tools/SeeDot/seedot/Predictor/Predictor.vcxproj.filters
+++ b/tools/SeeDot/seedot/Predictor/Predictor.vcxproj.filters
@ -0,0 +1,69 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup>
+    <Filter Include="Source Files">
+      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>
+      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>
+    </Filter>
+    <Filter Include="Header Files">
+      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>
+      <Extensions>h;hh;hpp;hxx;hm;inl;inc;ipp;xsd</Extensions>
+    </Filter>
+    <Filter Include="Resource Files">
+      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>
+      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>
+    </Filter>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="main.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="profile.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="seedot_fixed.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="library_fixed.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="library_float.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="seedot_float.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="debug.cpp">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="datatypes.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="profile.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="predictors.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="library_fixed.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="library_float.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="model_fixed.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="model_float.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="vars_fixed.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="vars_float.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+  </ItemGroup>
+</Project>
--- a/tools/SeeDot/seedot/Predictor/bonsai_float.cpp
+++ b/tools/SeeDot/seedot/Predictor/bonsai_float.cpp
@ -1,143 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#include <iostream>
-#include <cstring>
-
-#include "datatypes.h"
-#include "predictors.h"
-#include "bonsai_float_model.h"
-
-#define TANH 0
-
-using namespace std;
-using namespace bonsai_float;
-
-int bonsaiFloat(float *X) {
-	MYINT ite_idx, ite_val, index;
-
-	float ZX[d];
-	memset(ZX, 0, sizeof(float) * d);
-
-	ite_idx = 0;
-	ite_val = 0;
-	// Dimensionality reduction
-	for (MYINT i = 0; i < D; i++) {
-		float input = X[i];
-
-#if B_SPARSE_Z
-		index = Zidx[ite_idx];
-		while (index != 0) {
-			ZX[index - 1] += Zval[ite_val] * input;
-			ite_idx++;
-			ite_val++;
-			index = Zidx[ite_idx];
-		}
-		ite_idx++;
-#else
-		for (MYINT j = 0; j < d; j++) {
-			ZX[j] += Z[j][i] * input;
-		}
-#endif
-	}
-
-	for (MYINT i = 0; i < d; i++)
-		ZX[i] -= mean[i];
-
-	MYINT currNode = 0;
-	float WZX[c], VZX[c], score[c];
-
-	memset(score, 0, sizeof(float) * c);
-
-	while (currNode < internalNodes) {
-
-		memset(WZX, 0, sizeof(float) * c);
-		memset(VZX, 0, sizeof(float) * c);
-
-		// Accumulating score at each node
-		for (MYINT i = 0; i < d; i++) {
-			for (MYINT j = currNode * c; j < (currNode + 1) * c; j++) {
-				WZX[j % c] += W[j][i] * ZX[i];
-				VZX[j % c] += V[j][i] * ZX[i];
-			}
-		}
-
-		for (MYINT i = 0; i < c; i++) {
-			float t;
-			if (VZX[i] > tanh_limit)
-				t = tanh_limit;
-			else if (VZX[i] < -tanh_limit)
-				t = -tanh_limit;
-			else
-				t = VZX[i];
-
-#if TANH
-			score[i] += WZX[i] * tanh(VZX[i]);
-#else
-			score[i] += WZX[i] * t;
-#endif
-		}
-
-		// Computing theta value for branching into a child node
-		float val = 0;
-		for (MYINT i = 0; i < d; i++)
-			val += T[currNode][i] * ZX[i];
-
-		if (val > 0)
-			currNode = 2 * currNode + 1;
-		else
-			currNode = 2 * currNode + 2;
-	}
-
-	memset(WZX, 0, sizeof(float) * c);
-	memset(VZX, 0, sizeof(float) * c);
-
-	// Accumulating score for the last node
-	for (MYINT i = 0; i < d; i++) {
-		for (MYINT j = currNode * c; j < (currNode + 1) * c; j++) {
-			WZX[j % c] += W[j][i] * ZX[i];
-			VZX[j % c] += V[j][i] * ZX[i];
-		}
-	}
-
-	for (MYINT i = 0; i < c; i++) {
-		float t;
-		if (VZX[i] > tanh_limit)
-			t = tanh_limit;
-		else if (VZX[i] < -tanh_limit)
-			t = -tanh_limit;
-		else
-			t = VZX[i];
-
-#if TANH
-		score[i] += WZX[i] * tanh(VZX[i]);
-#else
-		score[i] += WZX[i] * t;
-#endif
-	}
-
-	MYINT classID;
-
-	// Finding the class ID
-	// If binary classification, the sign of the score is used
-	// If multiclass classification, argmax of score is used
-	if (c <= 2) {
-		if (score[0] > 0)
-			classID = 1;
-		else
-			classID = 0;
-	}
-	else {
-		float max = score[0];
-		MYINT maxI = 0;
-		for (MYINT i = 1; i < c; i++) {
-			if (score[i] > max) {
-				max = score[i];
-				maxI = i;
-			}
-		}
-		classID = maxI;
-	}
-
-	return classID;
-}
--- a/tools/SeeDot/seedot/Predictor/bonsai_float_model.h
+++ b/tools/SeeDot/seedot/Predictor/bonsai_float_model.h
--- a/tools/SeeDot/seedot/Predictor/datatypes.h
+++ b/tools/SeeDot/seedot/Predictor/datatypes.h
@ -4,15 +4,22 @@
 #pragma once

 #define INT16
-//#define INT32
-
-
-#ifdef INT16
 typedef int16_t MYINT;
-#endif
-
-#ifdef INT32
-typedef int32_t MYINT;
-#endif
-
+typedef int16_t MYITE;
 typedef uint16_t MYUINT;
+
+const int scaleForX = -12;
+
+const bool debugMode = false;
+
+const bool logProgramOutput = false;
+
+const int scalesForX[16] = {-12, -12, -12, -12, -12, -12, -12, -12, -12, -12, -12, -12};
+
+const int scaleForY = 0;
+
+const int scalesForY[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+//#define SATURATE
+//#define FASTAPPROX
+//#define FLOATEXP
--- a/tools/SeeDot/seedot/Predictor/debug.cpp
+++ b/tools/SeeDot/seedot/Predictor/debug.cpp
@ -0,0 +1,13 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include <iostream>
+
+#include "datatypes.h"
+#include "profile.h"
+#include "vars_fixed.h"
+#include "vars_float.h"
+
+using namespace std;
+
+void debug() {}
--- a/tools/SeeDot/seedot/Predictor/library.cpp
+++ b/tools/SeeDot/seedot/Predictor/library.cpp
@ -1,540 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#include <iostream>
-
-#include "datatypes.h"
-#include "library.h"
-
-// This file contains implementations of the linear algebra operators supported by SeeDot.
-// Each function takes the scaling factors as arguments along with the pointers to the operands.
-
-// C = A + B
-void MatAdd(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = B[i * J + j];
-
-			a = a / shrA;
-			b = b / shrB;
-
-			MYINT c = a + b;
-			c = c / shrC;
-
-			C[i * J + j] = c;
-		}
-	}
-	return;
-}
-
-// C = A - B
-void MatSub(MYINT *A, const MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = B[i * J + j];
-
-			a = a / shrA;
-			b = b / shrB;
-
-			MYINT c = a - b;
-			c = c / shrC;
-
-			C[i * J + j] = c;
-		}
-	}
-	return;
-}
-
-// C = A * B
-void MatMulNN(MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-void MatMulCN(const MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-void MatMulNC(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-void MatMulCC(const MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A |*| B
-void SparseMatMul(const MYINT *Aidx, const MYINT *Aval, MYINT **B, MYINT *C, MYINT K, MYINT shrA, MYINT shrB, MYINT shrC) {
-
-	MYINT ite_idx = 0, ite_val = 0;
-	for (MYINT k = 0; k < K; k++) {
-		// MYINT b = getIntFeature(k);
-		MYINT b = B[k * 1][0];
-		b = b / shrB;
-
-		MYINT idx = Aidx[ite_idx];
-		while (idx != 0) {
-			MYINT a = Aval[ite_val];
-			a = a / shrA;
-
-			MYINT c = a * b;
-			c = c / shrC;
-
-			C[idx - 1] += c;
-
-			ite_idx++;
-			ite_val++;
-
-			idx = Aidx[ite_idx];
-		}
-		ite_idx++;
-	}
-
-	return;
-}
-
-// C = A <*> B
-void MulCir(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = B[i * J + j];
-
-			a = a / shrA;
-			b = b / shrB;
-
-			C[i * J + j] = a * b;
-		}
-	}
-	return;
-}
-
-// A = tanh(A)
-void TanH(MYINT *A, MYINT I, MYINT J, MYINT tanh_limit) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT x = A[i * J + j], y;
-
-			if (x >= tanh_limit)
-				y = tanh_limit;
-			else if (x <= -tanh_limit)
-				y = -tanh_limit;
-			else
-				y = x;
-
-			A[i * J + j] = y;
-		}
-	}
-	return;
-}
-
-// index = argmax(A)
-void ArgMax(MYINT *A, MYINT I, MYINT J, MYINT *index) {
-
-	MYINT max = A[0], maxIndex = 0;
-	MYINT counter = 0;
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT x = A[i * J + j];
-
-			if (max < x) {
-				maxIndex = counter;
-				max = x;
-			}
-
-			counter++;
-		}
-	}
-
-	*index = maxIndex;
-
-	return;
-}
-
-// A = A^T
-void Transpose(MYINT *A, MYINT *B, MYINT I, MYINT J) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			B[i * J + j] = A[j * I + i];
-		}
-	}
-	return;
-}
-
-// C = a * B
-void ScalarMul(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
-
-	MYINT a = *A;
-	a = a / shrA;
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT b = B[i * J + j];
-			b = b / shrB;
-
-			C[i * J + j] = a * b;
-		}
-	}
-
-	return;
-}
-
-// C = A # B
-// A[N][H][W][CI], B[HF][WF][CI][CO], C[N][H][W][CO]
-void Conv(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-	MYINT padH = (HF - 1) / 2;
-	MYINT padW = (WF - 1) / 2;
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT co = 0; co < CO; co++) {
-
-					MYINT counter = 0;
-					for (MYINT hf = 0; hf < HF; hf++) {
-						for (MYINT wf = 0; wf < WF; wf++) {
-							for (MYINT ci = 0; ci < CI; ci++) {
-								MYINT a = (((((h + hf) < padH) || ((h + hf) >= (H + padH))) || (((w + wf) < padW) || ((w + wf) >= (W + padW)))) ? 0 : A[n * H * W * CI + ((h + hf) - padH) * W * CI + ((w + wf) - padW) * CI + ci]);
-								a = a / shrA;
-
-								MYINT b = B[hf * WF * CI * CO + wf * CI * CO + ci * CO + co];
-								b = b / shrB;
-
-								tmp[counter] = a * b;
-								counter++;
-							}
-						}
-					}
-
-					MYINT totalEle = HF * WF * CI;
-					MYINT count = HF * WF * CI, depth = 0;
-					bool shr = true;
-
-					while (depth < (H1 + H2)) {
-						if (depth >= H1)
-							shr = false;
-
-						for (MYINT p = 0; p < (totalEle / 2 + 1); p++) {
-							MYINT sum;
-							if (p < (count >> 1))
-								sum = tmp[2 * p] + tmp[(2 * p) + 1];
-							else if ((p == (count >> 1)) && ((count & 1) == 1))
-								sum = tmp[2 * p];
-							else
-								sum = 0;
-
-							if (shr)
-								tmp[p] = sum / 2;
-							else
-								tmp[p] = sum;
-						}
-						count = (count + 1) >> 1;
-
-						depth++;
-					}
-
-					C[n * H * W * CO + h * W * CO + w * CO + co] = tmp[0];
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = A <+> B
-// A[N][H][W][C], B[C]
-void AddOrSubCir4D(MYINT *A, const MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT c = 0; c < C; c++) {
-					MYINT a = A[n * H * W * C + h * W * C + w * C + c];
-					a = a / shrA;
-
-					MYINT b = B[c];
-					b = b / shrB;
-
-					MYINT res;
-					if (add)
-						res = a + b;
-					else
-						res = a - b;
-
-					res = res / shrC;
-
-					A[n * H * W * C + h * W * C + w * C + c] = res;
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = A <+> B
-// A[N][H][W][C], B[C]
-void AddOrSubCir2D(MYINT *A, const MYINT *B, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
-
-	for (MYINT h = 0; h < H; h++) {
-		for (MYINT w = 0; w < W; w++) {
-			MYINT a = A[h * W + w];
-			a = a / shrA;
-
-			MYINT b = B[w];
-			b = b / shrB;
-
-			MYINT res;
-			if (add)
-				res = a + b;
-			else
-				res = a - b;
-
-			res = res / shrC;
-
-			A[h * W + w] = res;
-		}
-	}
-
-	return;
-}
-
-// A = relu(A)
-// A[N][H][W][C]
-void Relu4D(MYINT *A, MYINT N, MYINT H, MYINT W, MYINT C) {
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT c = 0; c < C; c++) {
-					MYINT a = A[n * H * W * C + h * W * C + w * C + c];
-					if (a < 0)
-						a = 0;
-
-					A[n * H * W * C + h * W * C + w * C + c] = a;
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = relu(A)
-// A[N][H][W][C]
-void Relu2D(MYINT *A, MYINT H, MYINT W) {
-
-	for (MYINT h = 0; h < H; h++) {
-		for (MYINT w = 0; w < W; w++) {
-			MYINT a = A[h * W + w];
-			if (a < 0)
-				a = 0;
-
-			A[h * W + w] = a;
-		}
-	}
-
-	return;
-}
-
-// B = maxpool(A)
-// A[N][H][W][C], B[N][H][W][C]
-void Maxpool(MYINT *A, MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT stride) {
-	MYINT HO = H / stride;
-	MYINT WO = W / stride;
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT ho = 0; ho < HO; ho++) {
-			for (MYINT wo = 0; wo < WO; wo++) {
-				for (MYINT c = 0; c < C; c++) {
-
-					MYINT max = A[n * H * W * C + (stride * ho) * W * C + (stride * wo) * C + c];
-					for (MYINT hs = 0; hs < stride; hs++) {
-						for (MYINT ws = 0; ws < stride; ws++) {
-							MYINT a = A[n * H * W * C + ((stride * ho) + hs) * W * C + ((stride * wo) + ws) * C + c];
-							if (a > max)
-								max = a;
-						}
-					}
-
-					B[n * HO * WO * C + ho * WO * C + wo * C + c] = max;
-				}
-			}
-		}
-	}
-
-	return;
-}
--- a/tools/SeeDot/seedot/Predictor/library.h
+++ b/tools/SeeDot/seedot/Predictor/library.h
@ -1,40 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#pragma once
-
-void MatAdd(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
-
-void MatSub(MYINT *A, const MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
-
-void MatMulNN(MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
-
-void MatMulCN(const MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
-
-void MatMulNC(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
-
-void MatMulCC(const MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
-
-void SparseMatMul(const MYINT *Aidx, const MYINT *Aval, MYINT **B, MYINT *C, MYINT K, MYINT shrA, MYINT shrB, MYINT shrC);
-
-void MulCir(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB);
-
-void TanH(MYINT *A, MYINT I, MYINT J, MYINT tanh_limit);
-
-void ArgMax(MYINT *A, MYINT I, MYINT J, MYINT *index);
-
-void Transpose(MYINT *A, MYINT *B, MYINT I, MYINT J);
-
-void ScalarMul(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB);
-
-void Conv(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
-
-void AddOrSubCir4D(MYINT *A, const MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add);
-
-void AddOrSubCir2D(MYINT *A, const MYINT *B, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add);
-
-void Relu4D(MYINT *A, MYINT N, MYINT H, MYINT W, MYINT C);
-
-void Relu2D(MYINT *A, MYINT H, MYINT W);
-
-void Maxpool(MYINT *A, MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT stride);
--- a/tools/SeeDot/seedot/Predictor/library_fixed.cpp
+++ b/tools/SeeDot/seedot/Predictor/library_fixed.cpp
--- a/tools/SeeDot/seedot/Predictor/library_fixed.h
+++ b/tools/SeeDot/seedot/Predictor/library_fixed.h
--- a/tools/SeeDot/seedot/Predictor/library_float.cpp
+++ b/tools/SeeDot/seedot/Predictor/library_float.cpp
@ -0,0 +1,975 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include <iostream>
+#include <cmath>
+
+#include "datatypes.h"
+#include "library_float.h"
+#include "profile.h"
+
+// This file contains floating point implementations of operations supported by SeeDot.
+
+// C = A + B
+void MatAddNN(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+void MatAddCN(const float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+void MatAddNC(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+void MatAddCC(const float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = a + B
+void MatAddBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = *A;
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + b
+void MatAddBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = *B;
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+void MatAdd4(float* A, float* B, float* X, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+					float b = B[n * H * W * C + h * W * C + w * C + c];
+
+					float x = a + b;
+
+					X[n * H * W * C + h * W * C + w * C + c] = x;
+				}
+			}
+		}
+	}
+	return;
+}
+
+// C = A - B
+void MatSub(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = a - B
+void MatSubBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = *A;
+			float b = B[i * J + j];
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A - b
+void MatSubBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = *B;
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A * B
+void MatMulNN(float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+
+				count = (count + 1) >> 1;
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+void MatMulCN(const float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+
+				count = (count + 1) >> 1;
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+void MatMulNC(float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+
+				count = (count + 1) >> 1;
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+void MatMulCC(const float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+
+				count = (count + 1) >> 1;
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A |*| B
+void SparseMatMulX(const MYINT* Aidx, const float* Aval, float** B, float* C, int16_t K, MYINT shrA, MYINT shrB, MYINT shrC) {
+	MYITE ite_idx = 0, ite_val = 0;
+	for (MYITE k = 0; k < K; k++) {
+		float b = B[k * 1][0];
+
+		MYINT idx = Aidx[ite_idx];
+		while (idx != 0) {
+			float a = Aval[ite_val];
+
+			float c = a * b;
+
+			C[idx - 1] += c;
+
+			ite_idx++;
+			ite_val++;
+
+			idx = Aidx[ite_idx];
+		}
+		ite_idx++;
+	}
+	return;
+}
+
+// C = A |*| B
+void SparseMatMul(const MYINT* Aidx, const float* Aval, float* B, float* C, int16_t K, MYINT shrA, MYINT shrB, MYINT shrC) {
+	MYITE ite_idx = 0, ite_val = 0;
+	for (MYITE k = 0; k < K; k++) {
+		float b = B[k];
+
+		MYINT idx = Aidx[ite_idx];
+		while (idx != 0) {
+			float a = Aval[ite_val];
+
+			float c = a * b;
+
+			C[idx - 1] += c;
+
+			ite_idx++;
+			ite_val++;
+
+			idx = Aidx[ite_idx];
+		}
+		ite_idx++;
+	}
+
+	return;
+}
+
+// C = A <*> B
+void MulCir(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			C[i * J + j] = a * b;
+		}
+	}
+	return;
+}
+
+// A = tanh(A)
+void TanH(float* A, MYINT I, MYINT J, float scale_in, float scale_out, float* B) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j], y;
+
+			#ifdef FLOATEXP
+				y = tanh(x);
+			#else
+				y = x > -1 ? x : -1;
+				y = y < 1 ? y : 1;
+			#endif
+
+			B[i * J + j] = y;
+		}
+	}
+	return;
+}
+
+// B = reverse(A, axis)
+void Reverse2(float* A, MYINT axis, MYINT I, MYINT J, float* B) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			MYINT i_prime = (axis == 0 ? (I - 1 - i) : i);
+			MYINT j_prime = (axis == 1 ? (J - 1 - j) : j);
+
+			B[i * J + j] = A[i_prime*J + j_prime];
+		}
+	}
+	return;
+}
+
+// index = argmax(A)
+void ArgMax(float* A, MYINT I, MYINT J, int* index) {
+	float max = A[0];
+	MYITE maxIndex = 0, counter = 0;
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j];
+
+			if (max < x) {
+				maxIndex = counter;
+				max = x;
+			}
+			counter++;
+		}
+	}
+	*index = maxIndex;
+	return;
+}
+
+// A = A^T
+void Transpose(float* A, float* B, MYINT I, MYINT J) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			B[i * J + j] = A[j * I + i];
+		}
+	}
+	return;
+}
+
+// C = a * B
+void ScalarMul(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
+	float a = *A;
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float b = B[i * J + j];
+			C[i * J + j] = a * b;
+		}
+	}
+	return;
+}
+
+// C = MBConv(A, params)
+// A[N][H][W][Cin], C[N][Hout][Wout][Cout]
+// X[HF][W][Ct], T[Ct], U[max(Ct, Cin, HF*WF)]
+// F1[1][1][1][Cin][Ct], BN1W[Ct], BN1B[Ct]
+// F2[Ct][HF][WF][1][1], BN2W[Ct], BN2B[Ct]
+// F3[1][1][1][Ct][Cout], BN3W[Cout], BN3B[Cout]
+void MBConv(float* A, const float* F1, const float* BN1W, const float* BN1B, const float* F2, const float* BN2W, const float* BN2B, const float* F3, const float* BN3W, const float* BN3B, float* C, float* X, float* T, float* U, MYITE N, MYITE H, MYITE W, MYITE Cin, MYITE Ct, MYITE HF, MYITE WF, MYITE Cout, MYITE Hout, MYITE Wout, MYITE HPADL, MYITE HPADR, MYITE WPADL, MYITE WPADR, MYITE HSTR, MYITE WSTR, MYITE D1, MYITE D2, MYITE D3, MYINT SIX_1, MYINT SIX_2, MYINT shr1, MYINT shr2, MYINT shr3, MYINT shr4, MYINT shr5, MYINT shr6, MYINT shr7, MYINT shr8, MYINT shr9, MYINT shl1, MYINT shl2, MYINT shl3, MYINT shl4, MYINT shl5, MYINT shl6, MYINT shl7, MYINT shl8, MYINT shl9, std::string name) {
+	MYITE HOffsetL = (HF / 2) - HPADL;
+	MYITE WOffsetL = (WF / 2) - WPADL;
+	MYITE HOffsetR = (HF / 2) - HPADR;
+	MYITE WOffsetR = (WF / 2) - WPADR;
+
+	for (MYITE n = 0; n < N; n++) {
+		MYITE margin = HOffsetL + (HF / 2 + 1) - HSTR > 0 ? HOffsetL + (HF/2 + 1) - HSTR : 0;
+		MYITE nstart = HOffsetL - (HF / 2) < 0 ? 0 : HOffsetL - (HF / 2);
+		for (MYITE i = nstart; i < margin; i++) {
+			for (MYITE j = 0; j < W; j++) {
+				for (MYITE k = 0; k < Ct; k++) {
+					for (MYITE l = 0; l < Cin; l++) {
+						U[l] = A[n * H * W * Cin + i * W * Cin + j * Cin + l] * F1[l * Ct + k];
+					}
+					MYITE totalEle = Cin;
+					MYITE count = Cin;
+					MYITE depth = 0;
+
+					while (depth < D1) {
+						for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+							if (p < count / 2) {
+								U[p] = U[2 * p] + U[(2 * p) + 1];
+							} else if ((p == (count / 2)) && ((count % 2) == 1)) {
+								U[p] = U[2 * p];
+							} else {
+								U[p] = 0;
+							}
+						}
+
+						count = (count + 1) / 2;
+						depth++;
+					}
+	
+					float ar = U[0] + BN1B[k];
+					X[i * W * Ct + j * Ct + k] = (ar) * BN1W[k];
+					Profile2(&ar, 1, 1, name + "t1");
+					X[i * W * Ct + j * Ct + k] = X[i * W * Ct + j * Ct + k] < 0.0 ? 0.0 : X[i * W * Ct + j * Ct + k];
+					X[i * W * Ct + j * Ct + k] = X[i * W * Ct + j * Ct + k] > 6.0 ? 6.0 : X[i * W * Ct + j * Ct + k];
+				}
+			}
+		}
+
+		for (MYITE h = HOffsetL, hout = 0; h < H - HOffsetR; hout++, h += HSTR) {
+
+			for (MYITE i = 0; i < HSTR; i++) {
+				for (MYITE j = 0; j < W; j++) {
+					for (MYITE k = 0; k < Ct; k++) {
+						MYITE iRed = (i + margin + hout * HSTR) % HF, iFull = i + margin + hout * HSTR;
+						X[iRed * W * Ct + j * Ct + k] = 0.0;
+						for (MYITE l = 0; l < Cin; l++) {
+							float a = iFull < H ? A[n * H * W * Cin + iFull * W * Cin + j * Cin + l] : 0.0;
+							U[l] = a * F1[l * Ct + k];
+						}
+						MYITE totalEle = Cin;
+						MYITE count = Cin;
+						MYITE depth = 0;
+
+						while (depth < D1) {
+							for (MYITE p = 0; p <(totalEle / 2 + 1); p++) {
+								if (p < count / 2) {
+									U[p] = U[2 * p] + U[(2 * p) + 1];
+								} else if ((p == (count / 2)) && ((count % 2) == 1)) {
+									U[p] = U[2 * p];
+								} else {
+									U[p] = 0;
+								}
+							}
+
+							count = (count + 1) / 2;
+							depth++;
+						}
+
+						float ar = U[0] + BN1B[k];
+						X[iRed * W * Ct + j * Ct + k] = (ar) * BN1W[k];
+						Profile2(&ar, 1, 1, name + "t1");
+						X[iRed * W * Ct + j * Ct + k] = X[iRed * W * Ct + j * Ct + k] < 0.0 ? 0.0 : X[iRed * W * Ct + j * Ct + k];
+						X[iRed * W * Ct + j * Ct + k] = X[iRed * W * Ct + j * Ct + k] > 6.0 ? 6.0 : X[iRed * W * Ct + j * Ct + k];
+					}
+				}
+			}
+
+			for (MYITE w = WOffsetL, wout = 0; w < W - WOffsetR; w += WSTR, wout++) {
+				for (MYITE g = 0; g < Ct; g++) {
+					MYITE counter = 0;
+					for (MYITE hf = -(HF / 2); hf <= (HF / 2); hf++) {
+						for (MYITE wf = -(WF / 2); wf <= (WF / 2); wf++) {
+							float x = (((h + hf) < 0) || ((h + hf) >= H) || ((w + wf) < 0) || ((w + wf) >= W)) ? 0.0 : X[((h + hf) % HF) * W * Ct + (w + wf) * Ct + g];
+							float b = F2[g * HF * WF + (hf + HF / 2) * WF + (wf + WF / 2)];
+							U[counter] = x * b;
+							counter++;
+						}
+					}
+					MYITE totalEle = HF * WF;
+					MYITE count = HF * WF;
+					MYITE depth = 0;
+
+					while (depth < D2) {
+						for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+							if (p < count / 2) {
+								U[p] = U[2 * p] + U[(2 * p) + 1];
+							} else if ((p == (count / 2)) && ((count % 2) == 1)) {
+								U[p] = U[2 * p];
+							} else {
+								U[p] = 0;
+							}
+						}
+
+						count = (count + 1) / 2;
+						depth++;
+					}
+
+					float ar = U[0] + BN2B[g];
+					T[g] = (ar) * BN2W[g];
+					Profile2(&ar, 1, 1, name + "t3");
+					T[g] = T[g] < 0.0 ? 0.0 : T[g];
+					T[g] = T[g] > 6.0 ? 6.0 : T[g];
+				}
+
+				for (MYITE i = 0; i < Cout; i++) {
+					for (MYITE g = 0; g < Ct; g++) {
+						U[g] = T[g] * F3[g * Cout + i];
+					}
+					MYITE totalEle = Ct;
+					MYITE count = Ct;
+					MYITE depth = 0;
+
+					while (depth < D3) {
+						for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+							if (p < count / 2) {
+								U[p] = U[2 * p] + U[(2 * p) + 1];
+							} else if ((p == (count / 2)) && ((count % 2) == 1)) {
+								U[p] = U[2 * p];
+							} else {
+								U[p] = 0;
+							}
+						}
+
+						count = (count + 1) / 2;
+						depth++;
+					}
+
+					float ar = U[0] + BN3B[i];
+					C[n * Hout * Wout * Cout + hout * Wout * Cout + wout * Cout + i] = (ar) * BN3W[i];
+					Profile2(&ar, 1, 1, name + "t5");
+				}
+			}
+		}
+	}
+}
+
+// C = conv(A, B, <params>)
+// A[N][H][W][CIN], B[G][HF][WF][CINF][COUTF], C[N][HOUT][WOUT][COUTF*G]
+void Convolution(float* A, const float* B, float* C, float* tmp, MYINT N, MYINT H, MYINT W, MYINT CIN, MYINT HF, MYINT WF, MYINT CINF, MYINT COUTF, MYINT HOUT, MYINT WOUT, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR, MYINT HSTR, MYINT WSTR, MYINT HDL, MYINT WDL, MYINT G, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	MYITE HOffsetL = HDL * (HF / 2) - HPADL;
+	MYITE WOffsetL = WDL * (WF / 2) - WPADL;
+	MYITE HOffsetR = HDL * (HF / 2) - HPADR;
+	MYITE WOffsetR = WDL * (WF / 2) - WPADR;
+
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = HOffsetL, hout = 0; h < H - HOffsetR; h += HSTR, hout++) {
+			for (MYITE w = WOffsetL, wout = 0; w < W - WOffsetR; w += WSTR, wout++) {
+				for (MYITE g = 0; g < G; g++) {
+					for (MYITE co = 0; co < COUTF; co++) {
+
+						MYITE counter = 0;
+						for (MYITE hf = -(HF / 2); hf <= HF / 2; hf++) {
+							for (MYITE wf = -(WF / 2); wf <= WF / 2; wf++) {
+								for (MYITE ci = 0; ci < CINF; ci++) {
+									float a = (((h + HDL * hf) < 0) || ((h + HDL * hf) >= H) || ((w + WDL * wf) < 0) || ((w + WDL * wf) >= W)) ? 0 : A[n * H * W * CIN + (h + HDL * hf) * W * CIN + (w + WDL * wf) * CIN + (ci + g * CINF)];
+									float b = B[g * HF * WF * CINF * COUTF + (hf + HF / 2) * WF * CINF * COUTF + (wf + WF / 2) * CINF * COUTF + ci * COUTF + co];
+
+									tmp[counter] = a * b;
+									counter++;
+								}
+							}
+						}
+
+						MYITE totalEle = HF * WF * CINF;
+						MYITE count = HF * WF * CINF, depth = 0;
+						bool shr = true;
+
+						while (depth < (H1 + H2)) {
+							if (depth >= H1) {
+								shr = false;
+							}
+
+							for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+								float sum;
+								if (p < (count >> 1)) {
+									sum = tmp[2 * p] + tmp[(2 * p) + 1];
+								} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+									sum = tmp[2 * p];
+								} else {
+									sum = 0;
+								}
+
+								if (shr) {
+									tmp[p] = sum;
+								} else {
+									tmp[p] = sum;
+								}
+							}
+
+							count = (count + 1) >> 1;
+							depth++;
+						}
+
+						C[n * HOUT * WOUT * (COUTF * G) + hout * WOUT * (COUTF * G) + wout * (COUTF * G) + (co + g * COUTF)] = tmp[0];
+					}
+				}
+			}
+		}
+	}
+}
+
+// C = A # B
+// A[N][H][W][CI], B[HF][WF][CI][CO], C[N][H][W][CO]
+void Conv(float* A, const float* B, float* C, float* tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	MYITE padH = (HF - 1) / 2;
+	MYITE padW = (WF - 1) / 2;
+
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE co = 0; co < CO; co++) {
+
+					MYITE counter = 0;
+					for (MYITE hf = 0; hf < HF; hf++) {
+						for (MYITE wf = 0; wf < WF; wf++) {
+							for (MYITE ci = 0; ci < CI; ci++) {
+								float a = (((((h + hf) < padH) || ((h + hf) >= (H + padH))) || (((w + wf) < padW) || ((w + wf) >= (W + padW)))) ? 0 : A[n * H * W * CI + ((h + hf) - padH) * W * CI + ((w + wf) - padW) * CI + ci]);
+								float b = B[hf * WF * CI * CO + wf * CI * CO + ci * CO + co];
+
+								tmp[counter] = a * b;
+								counter++;
+							}
+						}
+					}
+
+					MYITE totalEle = HF * WF * CI;
+					MYITE count = HF * WF * CI, depth = 0;
+					bool shr = true;
+
+					while (depth < (H1 + H2)) {
+						if (depth >= H1) {
+							shr = false;
+						}
+
+						for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+							float sum;
+							if (p < (count >> 1)) {
+								sum = tmp[2 * p] + tmp[(2 * p) + 1];
+							} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+								sum = tmp[2 * p];
+							} else {
+								sum = 0;
+							}
+
+							if (shr) {
+								tmp[p] = sum;
+							} else {
+								tmp[p] = sum;
+							}
+						}
+
+						count = (count + 1) >> 1;
+						depth++;
+					}
+
+					C[n * H * W * CO + h * W * CO + w * CO + co] = tmp[0];
+				}
+			}
+		}
+	}
+
+	return;
+}
+
+// A = A <+> B
+// A[N][H][W][C], B[C]
+void AddOrSubCir4D(float* A, const float* B, float* X, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+					float b = B[c];
+
+					float res;
+					if (add) {
+						res = a + b;
+					} else {
+						res = a - b;
+					}
+
+					X[n * H * W * C + h * W * C + w * C + c] = res;
+				}
+			}
+		}
+	}
+	return;
+}
+
+// A = A <+> B
+// A[N][H][W][C], B[C]
+void AddOrSubCir2D(float* A, const float* B, float* X, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
+	for (MYITE h = 0; h < H; h++) {
+		for (MYITE w = 0; w < W; w++) {
+			float a = A[h * W + w];
+			float b = B[w];
+
+			float res;
+			if (add) {
+				res = a + b;
+			} else {
+				res = a - b;
+			}
+
+			X[h * W + w] = res;
+		}
+	}
+	return;
+}
+
+// A = relu(A)
+// A[N][H][W][C]
+void Relu4D(float* A, MYINT N, MYINT H, MYINT W, MYINT C) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+					if (a < 0) {
+						a = 0;
+					}
+
+					A[n * H * W * C + h * W * C + w * C + c] = a;
+				}
+			}
+		}
+	}
+	return;
+}
+
+// B = relu6(A)
+// A[N][H][W][C]
+void Relu6(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT six, MYINT div) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+					if (a < 0) {
+						a = 0;
+					}
+					if (a > 6) {
+						a = 6;
+					}
+
+					B[n * H * W * C + h * W * C + w * C + c] = a;
+				}
+			}
+		}
+	}
+	return;
+}
+
+// A = relu(A)
+// A[N][H][W][C]
+void Relu2D(float* A, MYINT H, MYINT W) {
+	for (MYITE h = 0; h < H; h++) {
+		for (MYITE w = 0; w < W; w++) {
+			float a = A[h * W + w];
+			if (a < 0) {
+				a = 0;
+			}
+
+			A[h * W + w] = a;
+		}
+	}
+	return;
+}
+
+// B = maxpool(A)
+// A[N][H][W][C], B[N][H][W][C]
+void Maxpool(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT FH, MYINT FW, MYINT strideH, MYINT strideW, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR) {
+	MYITE HO = H / strideH;
+	MYITE WO = W / strideW;
+
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE ho = 0; ho < HO; ho++) {
+			for (MYITE wo = 0; wo < WO; wo++) {
+				for (MYITE c = 0; c < C; c++) {
+
+					float max = A[n * H * W * C + (strideH * ho) * W * C + (strideW * wo) * C + c];
+					for (MYITE hs = 0; hs < FH; hs++) {
+						for (MYITE ws = 0; ws < FW; ws++) {
+							float a = A[n * H * W * C + ((strideH * ho) + hs) * W * C + ((strideW * wo) + ws) * C + c];
+							if (a > max) {
+								max = a;
+							}
+						}
+					}
+
+					B[n * HO * WO * C + ho * WO * C + wo * C + c] = max;
+				}
+			}
+		}
+	}
+	return;
+}
+
+void NormaliseL2(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT scaleA, MYINT shrA) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+
+				// Calculate the sum square.
+				float sumSquare = 0;
+				for (MYITE c = 0; c < C; c++) {
+					float tmp = A[n * H * W * C + h * W * C + w * C + c];
+					sumSquare += tmp * tmp;
+				}
+
+				// Calculate the inverse square root of sumSquare.
+				if (sumSquare == 0) {
+					sumSquare = 1e-5;
+				}
+
+				float inverseNorm = 1 / sqrt(sumSquare);
+
+				// Multiply all elements by the 1 / sqrt(sumSquare).
+				for (MYITE c = 0; c < C; c++) {
+					B[n * H * W * C + h * W * C + w * C + c]  = A[n * H * W * C + h * W * C + w * C + c]  * inverseNorm;
+				}
+			}
+		}
+	}
+	return;
+}
+
+// B = exp(A)
+void Exp(float* A, MYINT I, MYINT J, MYINT shrA, MYINT shrB, float* B) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j];
+
+			updateRangeOfExp(-x);
+
+			B[i * J + j] = exp(x);
+		}
+	}
+	return;
+}
+
+// A = sigmoid(A)
+void Sigmoid(float* A, MYINT I, MYINT J, float div, float add, float sigmoid_limit, MYINT scale_in, MYINT scale_out, float* B) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j], y;
+
+#ifdef FLOATEXP
+			y = 1 / (1 + exp(-x));
+#else
+			y = (x + 1) / 2;
+			y = y > 0 ? y : 0;
+			y = y < 1 ? y : 1;
+#endif
+			B[i * J + j] = y;
+		}
+	}
+	return;
+}
+
+// A = AdjustScaleShr(A)
+void AdjustScaleShr(float* A, MYINT I, MYINT J, MYINT scale) {
+	return;
+}
+
+// A = AdjustScaleShl(A)
+void AdjustScaleShl(float* A, MYINT I, MYINT J, MYINT scale) {
+	return;
+}
--- a/tools/SeeDot/seedot/Predictor/library_float.h
+++ b/tools/SeeDot/seedot/Predictor/library_float.h
@ -0,0 +1,68 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+// This file contains declarations for floating point versions of all operators supported by SeeDot.
+// Please refer to library_fixed.h for a description of each operator.
+
+void MatAddNN(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatAddCN(const float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatAddNC(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatAddCC(const float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+
+void MatAddBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatAddBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+
+void MatAdd4(float* A, float* B, float* X, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC);
+
+void MatSub(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatSubBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+void MatSubBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC);
+
+void MatMulNN(float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+void MatMulCN(const float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+void MatMulNC(float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+void MatMulCC(const float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+
+void SparseMatMulX(const MYINT* Aidx, const float* Aval, float** B, float* C, int16_t K, MYINT shrA, MYINT shrB, MYINT shrC);
+void SparseMatMul(const MYINT* Aidx, const float* Aval, float* B, float* C, int16_t K, MYINT shrA, MYINT shrB, MYINT shrC);
+
+void MulCir(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB);
+
+void TanH(float* A, MYINT I, MYINT J, float scale_in, float scale_out, float* B);
+
+void ArgMax(float* A, MYINT I, MYINT J, int* index);
+
+void Transpose(float* A, float* B, MYINT I, MYINT J);
+
+void ScalarMul(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB);
+
+void MBConv(float* A, const float* F1, const float* BN1W, const float* BN1B, const float* F2, const float* BN2W, const float* BN2B, const float* F3, const float* BN3W, const float* BN3B, float* C, float* X, float* T, float* U, MYITE N, MYITE H, MYITE W, MYITE Cin, MYITE Ct, MYITE HF, MYITE WF, MYITE Cout, MYITE Hout, MYITE Wout, MYITE HPADL, MYITE HPADR, MYITE WPADL, MYITE WPADR, MYITE HSTR, MYITE WSTR, MYITE D1, MYITE D2, MYITE D3, MYINT SIX_1, MYINT SIX_2, MYINT shr1, MYINT shr2, MYINT shr3, MYINT shr4, MYINT shr5, MYINT shr6, MYINT shr7, MYINT shr8, MYINT shr9, MYINT shl1, MYINT shl2, MYINT shl3, MYINT shl4, MYINT shl5, MYINT shl6, MYINT shl7, MYINT shl8, MYINT shl9, std::string name);
+
+void Conv(float* A, const float* B, float* C, float* tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+
+void Convolution(float* A, const float* B, float* C, float* tmp, MYINT N, MYINT H, MYINT W, MYINT CIN, MYINT HF, MYINT WF, MYINT CINF, MYINT COUTF, MYINT HOUT, MYINT WOUT, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR, MYINT HSTR, MYINT WSTR, MYINT HDL, MYINT WDL, MYINT G, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2);
+
+void AddOrSubCir4D(float* A, const float* B, float* X, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add);
+
+void AddOrSubCir2D(float* A, const float* B, float* X, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add);
+
+void Relu4D(float* A, MYINT N, MYINT H, MYINT W, MYINT C);
+
+void Relu2D(float* A, MYINT H, MYINT W);
+
+void Relu6(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT six, MYINT div);
+
+void Maxpool(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT FH, MYINT FW, MYINT strideH, MYINT strideW, MYINT HPADL, MYINT HPADR, MYINT WPADL, MYINT WPADR);
+
+void Exp(float* A, MYINT I, MYINT J, MYINT shrA, MYINT shrB, float* B);
+
+void Sigmoid(float* A, MYINT I, MYINT J, float div, float add, float sigmoid_limit, MYINT scale_in, MYINT scale_out, float* B);
+
+void AdjustScaleShr(float* A, MYINT I, MYINT J, MYINT scale);
+void AdjustScaleShl(float* A, MYINT I, MYINT J, MYINT scale);
+
+void Reverse2(float* A, MYINT axis, MYINT I, MYINT J, float* B);
+
+void NormaliseL2(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT scaleA, MYINT shrA);
--- a/tools/SeeDot/seedot/Predictor/main.cpp
+++ b/tools/SeeDot/seedot/Predictor/main.cpp
@ -5,7 +5,12 @@
 #include <fstream>
 #include <sstream>
 #include <vector>
+#include <list>
 #include <cstring>
+#include <cmath>
+#include <cstdlib>
+#include <thread>
+#include <algorithm>

 #include "datatypes.h"
 #include "predictors.h"
@ -13,102 +18,180 @@

 using namespace std;

-enum Algo { Bonsai, Protonn };
-enum Version { Fixed, Float };
-enum DatasetType { Training, Testing };
+/*
+ * This file is the driver for the x86 version of the code. It reads the floating point data from the csv files, parses them
+ * and translates them into integers, and then puts them through multiple generated inference codes and evaluates each result.
+ */

-// Split the CSV row into multiple values
+enum Version
+{
+	Fixed,
+	Float
+};
+enum DatasetType
+{
+	Training,
+	Testing
+};
+enum ProblemType
+{
+	Classification,
+	Regression
+};
+
+bool profilingEnabled = false;
+
+// Split the CSV row into multiple values.
 vector<string> readCSVLine(string line) {
 	vector<string> tokens;

 	stringstream stream(line);
 	string str;

-	while (getline(stream, str, ','))
+	while (getline(stream, str, ',')) {
 		tokens.push_back(str);
+	}

 	return tokens;
 }

+// Read the input 'X'.
 vector<string> getFeatures(string line) {
 	static int featuresLength = -1;

 	vector<string> features = readCSVLine(line);

-	if (featuresLength == -1)
+	if (featuresLength == -1) {
 		featuresLength = (int)features.size();
+	}

-	if ((int)features.size() != featuresLength)
+	if ((int)features.size() != featuresLength) {
 		throw "Number of row entries in X is inconsistent";
+	}

 	return features;
 }

-int getLabel(string line) {
+// Read the ground truth label/value 'Y'.
+vector<string> getLabel(string line) {
 	static int labelLength = -1;

 	vector<string> labels = readCSVLine(line);

-	if (labelLength == -1)
+	if (labelLength == -1) {
 		labelLength = (int)labels.size();
+	}

-	if ((int)labels.size() != labelLength || labels.size() != 1)
+	if ((int)labels.size() != labelLength) {
 		throw "Number of row entries in Y is inconsistent";
+	}

-	return (int)atoi(labels.front().c_str());
+	return labels;
 }

-int main(int argc, char *argv[]) {
+// Take in the input floating point datapoint, convert it to a fixed point integer and store it.
+void populateFixedVector(MYINT** features_int, vector<string> features, int scale) {
+	int features_size = (int)features.size();
+
+	for (int i = 0; i < features_size; i++) {
+		double f = (double)(atof(features.at(i).c_str()));
+		double f_int = ldexp(f, -scale);
+		features_int[i][0] = (MYINT)(f_int);
+	}
+
+	return;
+}
+
+// Take in the input floating point datapoint and store it.
+void populateFloatVector(float** features_float, vector<string> features) {
+	int features_size = (int)features.size();
+	for (int i = 0; i < features_size; i++) {
+		features_float[i][0] = (float)(atof(features.at(i).c_str()));
+	}
+	return;
+}
+
+// Multi-threading is used to speed up exploration.
+// Each thread, which invokes the following method, is responsible for taking in one datapoint
+// and running it through all the generated codes.
+// Number of threads generated equals the number of datapoints in the given dataset.
+void launchThread(int features_size, MYINT** features_int, MYINT*** features_intV, float** features_float, int counter, float* float_res, int* res, int** resV) {
+	seedotFixed(features_int, res);
+	seedotFloat(features_float, float_res);
+
+	for (int i = 0; i < switches; i++) {
+		seedotFixedSwitch(i, features_intV[i], resV[i]);
+	}
+
+	for (int i = 0; i < features_size; i++) {
+		delete features_int[i];
+		delete features_float[i];
+		for (int j = 0; j < switches; j++) {
+			delete features_intV[j][i];
+		}
+	}
+	delete[] features_int;
+	delete[] features_float;
+	for (int j = 0; j < switches; j++) {
+		delete[] features_intV[j];
+	}
+	delete[] features_intV;
+}
+
+int main(int argc, char* argv[]) {
+	float epsilon = 0.00001;
 	if (argc == 1) {
 		cout << "No arguments supplied" << endl;
 		return 1;
 	}

-	// Parsing the arguments
-	Algo algo;
-	if (strcmp(argv[1], "bonsai") == 0)
-		algo = Bonsai;
-	else if (strcmp(argv[1], "protonn") == 0)
-		algo = Protonn;
-	else {
-		cout << "Argument mismatch for algo\n";
-		return 1;
-	}
-	string algoStr = argv[1];
-
 	Version version;
-	if (strcmp(argv[2], "fixed") == 0)
+	if (strcmp(argv[1], "fixed") == 0) {
 		version = Fixed;
-	else if (strcmp(argv[2], "float") == 0)
+	} else if (strcmp(argv[1], "float") == 0) {
 		version = Float;
-	else {
+	} else {
 		cout << "Argument mismatch for version\n";
 		return 1;
 	}
-	string versionStr = argv[2];
+	string versionStr = argv[1];

 	DatasetType datasetType;
-	if (strcmp(argv[3], "training") == 0)
+	if (strcmp(argv[2], "training") == 0) {
 		datasetType = Training;
-	else if (strcmp(argv[3], "testing") == 0)
+	} else if (strcmp(argv[2], "testing") == 0) {
 		datasetType = Testing;
-	else {
+	} else {
 		cout << "Argument mismatch for dataset type\n";
 		return 1;
 	}
-	string datasetTypeStr = argv[3];
+	string datasetTypeStr = argv[2];

-	// Reading the dataset
+	ProblemType problem;
+	if (strcmp(argv[3], "classification") == 0) {
+		problem = Classification;
+	} else if (strcmp(argv[3], "regression") == 0) {
+		problem = Regression;
+	} else {
+		cout << "Argument mismatch for problem type\n";
+		return 1;
+	}
+	string problemTypeStr = argv[3];
+
+	int numOutputs = atoi(argv[4]);
+
+	// Reading the dataset.
 	string inputDir = "input/";

 	ifstream featuresFile(inputDir + "X.csv");
 	ifstream lablesFile(inputDir + "Y.csv");

-	if (featuresFile.good() == false || lablesFile.good() == false)
+	if (featuresFile.good() == false || lablesFile.good() == false) {
 		throw "Input files doesn't exist";
+	}

-	// Create output directory and files
-	string outputDir = "output/" + algoStr + "-" + versionStr;
+	// Create output directory and files.
+	string outputDir = "output/" + versionStr;

 	string outputFile = outputDir + "/prediction-info-" + datasetTypeStr + ".txt";
 	string statsFile = outputDir + "/stats-" + datasetTypeStr + ".txt";
@ -116,88 +199,312 @@ int main(int argc, char *argv[]) {
 	ofstream output(outputFile);
 	ofstream stats(statsFile);

-	int correct = 0, total = 0;
-
 	bool alloc = false;
 	int features_size = -1;
-	MYINT **features_int = NULL;
-	float *features_float = NULL;
+	MYINT** features_int = NULL;
+	vector<MYINT**> features_intV(switches, NULL);
+	float** features_float = NULL;

-	// Initialize variables used for profiling
+	// Initialize variables used for profiling.
 	initializeProfiling();

-	string line1, line2;
-	while (getline(featuresFile, line1) && getline(lablesFile, line2)) {
-		// Read the feature vector and class ID
-		vector<string> features = getFeatures(line1);
-		int label = getLabel(line2);
+	// Following variables are used for storing the results of the inference.
+	vector<float*> vector_float_res;
+	vector<int32_t*> vector_int_res;
+	vector<int32_t**> vector_int_resV;
+	vector<int32_t*> labelsInt;
+	vector<float*> labelsFloat;
+	list<thread> threads;

-		// Allocate memory to store the feature vector as arrays
+	MYINT*** features_intV_copy;
+
+	string line1, line2;
+	int counter = 0;
+
+	if (version == Float) {
+		profilingEnabled = true;
+	}
+
+	// Each iteration takes care of one datapoint.
+	while (getline(featuresFile, line1) && getline(lablesFile, line2)) {
+		// Read the feature vector and class ID.
+		vector<string> features = getFeatures(line1);
+		vector<string> labelString = getLabel(line2);
+		int32_t* labelInt = new int32_t[numOutputs];
+		float* labelFloat = new float[numOutputs];
+
+		if (problem == Classification) {
+			for (int i = 0; i < numOutputs; i++) {
+				labelInt[i] = atoi(labelString[i].c_str());
+			}
+		} else if (problem == Regression) {
+			for (int i = 0; i < numOutputs; i++) {
+				labelFloat[i] = atof(labelString[i].c_str());
+			}
+		}
+
+		// Allocate memory to store the feature vector as arrays.
 		if (alloc == false) {
 			features_size = (int)features.size();

-			if (version == Fixed) {
-				features_int = new MYINT*[features_size];
-				for (int i = 0; i < features_size; i++)
-					features_int[i] = new MYINT[1];
+			features_int = new MYINT* [features_size];
+			for (int i = 0; i < features_size; i++) {
+				features_int[i] = new MYINT[1];
+			}
+
+			for (int i = 0; i < switches; i++) {
+				features_intV[i] = new MYINT* [features_size];
+				for (int j = 0; j < features_size; j++) {
+					features_intV[i][j] = new MYINT[1];
+				}
+			}
+
+			features_float = new float* [features_size];
+			for (int i = 0; i < features_size; i++) {
+				features_float[i] = new float[1];
 			}
-			else
-				features_float = new float[features_size];

 			alloc = true;
 		}

-		// Populate the array using the feature vector
-		if (version == Fixed)
-			for (int i = 0; i < features_size; i++) {
-#ifdef INT16
-				features_int[i][0] = (MYINT)(atol(features.at(i).c_str()));
-#endif
-#ifdef INT32
-				features_int[i][0] = (MYINT)(atoll(features.at(i).c_str()));
-#endif
+		// Populate the array using the feature vector.
+		if (debugMode || version == Fixed) {
+			populateFixedVector(features_int, features, scaleForX);
+			for (int i = 0; i < switches; i++) {
+				populateFixedVector(features_intV[i], features, scalesForX[i]);
 			}
-		else
-			for (int i = 0; i < features_size; i++)
-				features_float[i] = (float)(atof(features.at(i).c_str()));
-
-		// Invoke the predictor function
-		int res = -1;
-		if (algo == Bonsai && version == Fixed)
-			res = seedotFixed(features_int);
-		else if (algo == Bonsai && version == Float)
-			res = bonsaiFloat(features_float);
-		else if (algo == Protonn && version == Fixed)
-			res = seedotFixed(features_int);
-		else if (algo == Protonn && version == Float)
-			res = protonnFloat(features_float);
-
-		if ((res) == label) {
-			correct++;
-		}
-		else {
-			output << "Incorrect prediction for input " << total + 1 << ". Predicted " << res + 1 << " Expected " << label << endl;
+			populateFloatVector(features_float, features);
+		} else {
+			populateFloatVector(features_float, features);
 		}

-		total++;
+		// Invoke the predictor function.
+		int* fixed_res = NULL;
+		float* float_res = NULL;
+		vector <int> resV(switches, -1);
+
+		if (debugMode) {
+			float_res = new float[numOutputs];
+			seedotFloat(features_float, float_res);
+			fixed_res = new int32_t[numOutputs];
+			seedotFixed(features_int, fixed_res);
+			//debug();
+			vector_float_res.push_back(float_res);
+			vector_int_res.push_back(fixed_res);
+			if (problem == Classification) {
+				labelsInt.push_back(labelInt);
+			} else if (problem == Regression) {
+				labelsFloat.push_back(labelFloat);
+			}
+			vector_int_resV.push_back(NULL);
+		} else {
+			// There are several codes generated which are built simultaneously.
+			if (version == Fixed) {
+				vector_float_res.push_back(new float[numOutputs]);
+				vector_int_res.push_back(new int32_t[numOutputs]);
+				// Populating labels for each generated code.
+				if (problem == Classification) {
+					labelsInt.push_back(labelInt);
+				} else if (problem == Regression) {
+					labelsFloat.push_back(labelFloat);
+				}
+				int** switchRes = new int* [switches];
+				// Instantiating vectors for storing inference results for each generated code.
+				for (int i = 0; i < switches; i++) {
+					switchRes[i] = new int[numOutputs];
+				}
+				vector_int_resV.push_back(switchRes);
+				// Instantiating vectors for storing features, integer and float.
+				MYINT** features_int_copy = new MYINT* [features_size];
+				for (int i = 0; i < features_size; i++) {
+					features_int_copy[i] = new MYINT[1];
+					features_int_copy[i][0] = features_int[i][0];
+				}
+				float** features_float_copy = new float* [features_size];
+				for (int i = 0; i < features_size; i++) {
+					features_float_copy[i] = new float[1];
+					features_float_copy[i][0] = features_float[i][0];
+				}
+				features_intV_copy = new MYINT** [switches];
+				for (int j = 0; j < switches; j++) {
+					features_intV_copy[j] = new MYINT* [features_size];
+					for (int i = 0; i < features_size; i++) {
+						features_intV_copy[j][i] = new MYINT[1];
+						features_intV_copy[j][i][0] = features_intV[j][i][0];
+					}
+				}
+				// Launching one thread which processes one datapoint.
+				if (threads.size() < 64) {
+					threads.push_back(thread(launchThread, features_size, features_int_copy, features_intV_copy, features_float_copy, counter, vector_float_res.back(), vector_int_res.back(), vector_int_resV.back()));
+				} else {
+					threads.front().join();
+					threads.pop_front();
+					threads.push_back(thread(launchThread, features_size, features_int_copy, features_intV_copy, features_float_copy, counter, vector_float_res.back(), vector_int_res.back(), vector_int_resV.back()));
+				}
+			} else if (version == Float) {
+				float_res = new float[numOutputs];
+				seedotFloat(features_float, float_res);
+				vector_float_res.push_back(float_res);
+				vector_int_res.push_back(new int[numOutputs]);
+				if (problem == Classification) {
+					labelsInt.push_back(labelInt);
+				} else if (problem == Regression) {
+					labelsFloat.push_back(labelFloat);
+				}
+				vector_int_resV.push_back(NULL);
+			}
+		}
+
+		if (!logProgramOutput) {
+			output << "Inputs handled = " << counter + 1 << endl;
+		}
+
+		flushProfile();
+		counter++;
 	}

-	// Deallocate memory
-	if (version == Fixed) {
-		for (int i = 0; i < features_size; i++)
-			delete features_int[i];
-		delete features_int;
+	for (list<thread>::iterator it = threads.begin(); it != threads.end(); it++) {
+		it->join();
+	}
+
+	float disagreements = 0.0, reduced_disagreements = 0.0;
+
+	// Correct, Disagreements are used for Classification problems' accuracy etc.
+	// Errors, Ferrors are used for Regression problems' error etc.
+
+	vector<int> correctV(switches, 0), totalV(switches, 0);
+	vector<int> disagreementsV(switches, 0), reduced_disagreementsV(switches, 0);
+
+	vector<float> errors(0, 0), ferrors(0, 0);
+	vector<vector<float>> errorsV(switches, vector<float>(0, 0)), ferrorsV(switches, vector<float>(0, 0));
+
+	ofstream trace("trace.txt");
+
+	int correct = 0, total = 0;
+	for (int i = 0; i < counter; i++) {
+		int* fixed_res = vector_int_res[i];
+		float* float_res = vector_float_res[i];
+		int** resV = vector_int_resV[i];
+
+		if (problem == Classification) {
+			for (int j = 0; j < numOutputs; j++) {
+				float res;
+				if (version == Float) {
+					res = float_res[j];
+				} else {
+					res = (float) fixed_res[j];
+				}
+
+				if (res != float_res[j]) {
+					if (float_res[j] == labelsInt[i][j]) {
+						reduced_disagreements++;
+					}
+					disagreements++;
+				}
+
+				if (res == labelsInt[i][j]) {
+					correct++;
+				} else {
+					if (logProgramOutput) {
+						output << "Main: Incorrect prediction for input " << total + 1 << " element " << j << ". Predicted " << res << " Expected " << labelsInt[i][j] << endl;
+					}
+				}
+				total++;
+
+				for (int k = 0; k < switches; k++) {
+					if (version == Float) {
+						throw "Multiple codes not expected in Floating point execution";
+					}
+
+					if (resV[k][j] != float_res[j]) {
+						if (float_res[j] == labelsInt[i][j]) {
+							reduced_disagreementsV[k]++;
+						}
+						disagreementsV[k]++;
+					}
+
+					if (resV[k][j] == labelsInt[i][j]) {
+						correctV[k]++;
+					} else {
+						if (logProgramOutput) {
+							output << "Sub "<< k <<": Incorrect prediction for input " << total + 1 << " element " << j << ". Predicted " << resV[k][j] << " Expected " << labelsInt[i][j] << endl;
+						}
+					}
+					totalV[k]++;
+				}
+			}
+		} else {
+			for (int j = 0; j < numOutputs; j++) {
+				float res;
+				if (version == Float) {
+					res = float_res[j];
+				} else {
+					res = ((float)fixed_res[j]) / ldexp(1.0, -scaleForY);
+				}
+
+				trace << res << " ";
+
+				float error = 100.0 * fabs(res - labelsFloat[i][j]);
+				float ferror = 100.0 * fabs(res - float_res[j]);
+				errors.push_back(error);
+				ferrors.push_back(ferror);
+				total++;
+
+				for (int k = 0; k < switches; k++) {
+					if (version == Float) {
+						throw "Multiple codes not expected in Floating point execution";
+					}
+					float normRes = ((float) resV[k][j]) / ldexp(1.0 , -scalesForY[k]);
+					float error = 100.0 * fabs(normRes - labelsFloat[i][j]);
+					float ferror = 100.0 * fabs(normRes - float_res[j]);
+					errorsV[k].push_back(error);
+					ferrorsV[k].push_back(ferror);
+					totalV[k]++;
+				}
+			}
+		}
+
+		// Clearing memory.
+		delete[] vector_int_res[i];
+		delete[] vector_float_res[i];
+		for (int k = 0; k < switches; k++) {
+			delete[] vector_int_resV[i][k];
+		}
+		delete[] vector_int_resV[i];
+
+		trace << endl;
+	}
+
+	trace.close();
+
+	// Deallocate memory.
+	for (int i = 0; i < features_size; i++) {
+		delete features_int[i];
+	}
+	delete[] features_int;
+
+	for (int i = 0; i < features_size; i++) {
+		delete features_float[i];
+	}
+	delete[] features_float;
+
+	for (int i = 0; i < switches; i++) {
+		for (int j = 0; j < features_size; j++) {
+			delete features_intV[i][j];
+		}
+		delete[] features_intV[i];
 	}
-	else
-		delete features_float;

 	float accuracy = (float)correct / total * 100.0f;

-	cout.precision(3);
-	cout << fixed;
-	cout << "\n\n#test points = " << total << endl;
-	cout << "Correct predictions = " << correct << endl;
-	cout << "Accuracy = " << accuracy << "\n\n";
+	if ((argc == 6) && (argv[5] == "False"))
+	{
+		cout.precision(3);
+		cout << fixed;
+		cout << "\n\n#test points = " << total << endl;
+		cout << "Correct predictions = " << correct << endl;
+		cout << "Accuracy = " << accuracy << "\n\n";
+	}

 	output.precision(3);
 	output << fixed;
@ -208,11 +515,49 @@ int main(int argc, char *argv[]) {

 	stats.precision(3);
 	stats << fixed;
-	stats << accuracy << "\n";
+	stats << "default" << "\n";
+	if (problem == Classification) {
+		stats << accuracy << "\n";
+		stats << ((float) disagreements) / numOutputs << "\n";
+		stats << ((float) reduced_disagreements) / numOutputs << "\n";
+	} else if (problem == Regression) {
+		sort(errors.begin(), errors.end());
+		sort(ferrors.begin(), ferrors.end());
+		int index = 0.95 * errors.size() - 1;
+		index = index > 0 ? index : 0;
+		stats << errors[index] << "\n";
+		stats << ferrors[index] << "\n";
+		stats << "0.000\n";
+	}
+
+	if (version == Fixed) {
+		for (int i = 0; i < switches; i++) {
+			stats << i + 1 << "\n";
+			if (problem == Classification) {
+				stats << (float)correctV[i] / totalV[i] * 100.0f << "\n";
+				stats << ((float) disagreementsV[i]) / numOutputs << "\n";
+				stats << ((float) reduced_disagreementsV[i]) / numOutputs << "\n";
+			} else if (problem == Regression) {
+				sort(errorsV[i].begin(), errorsV[i].end());
+				sort(ferrorsV[i].begin(), ferrorsV[i].end());
+				int index = 0.95 * errorsV[i].size() - 1;
+				index = index > 0 ? index : 0;
+				stats << errorsV[i][index] << "\n";
+				stats << ferrorsV[i][index] << "\n";
+				stats << "0.000\n";
+			}
+		}
+	}
+
 	stats.close();

-	if (datasetType == Training)
+	if (version == Float) {
+		dumpProfile();
+	}
+
+	if (datasetType == Training) {
 		dumpRange(outputDir + "/profile.txt");
+	}

 	return 0;
 }
--- a/tools/SeeDot/seedot/Predictor/model_fixed.h
+++ b/tools/SeeDot/seedot/Predictor/model_fixed.h
--- a/tools/SeeDot/seedot/Predictor/model_float.h
+++ b/tools/SeeDot/seedot/Predictor/model_float.h
--- a/tools/SeeDot/seedot/Predictor/predictors.h
+++ b/tools/SeeDot/seedot/Predictor/predictors.h
@ -3,8 +3,8 @@

 #pragma once

-int seedotFixed(MYINT **X);
+void seedotFixed(MYINT** X, int32_t* res);
+void seedotFloat(float** X, float* res);
+void seedotFixedSwitch(int i, MYINT** X, int32_t* res);

-int bonsaiFloat(float *X);
-int lenetFloat(float *X);
-int protonnFloat(float *X);
+extern const int switches;
--- a/tools/SeeDot/seedot/Predictor/profile.cpp
+++ b/tools/SeeDot/seedot/Predictor/profile.cpp
@ -4,7 +4,13 @@
 #include <iostream>
 #include <fstream>
 #include <limits>
+#include <cmath>
+#include <unordered_map>
+#include <algorithm>
+#include <vector>
+#include <cfloat>

+#include "datatypes.h"
 #include "profile.h"

 using namespace std;
@ -23,18 +29,22 @@ void initializeProfiling() {
 }

 void updateRange(float x) {
-	if (x < m_all)
+	if (x < m_all) {
 		m_all = x;
-	if (x > M_all)
+	}
+	if (x > M_all) {
 		M_all = x;
+	}
 	return;
 }

 void updateRangeOfExp(float x) {
-	if (x < m_exp)
+	if (x < m_exp) {
 		m_exp = x;
-	if (x > M_exp)
+	}
+	if (x > M_exp) {
 		M_exp = x;
+	}
 	return;
 }

@ -48,3 +58,221 @@ void dumpRange(string outputFile) {

 	return;
 }
+
+unordered_map<string, float> min_all;
+unordered_map<string, float> max_all;
+
+unordered_map<string, float> min_temp;
+unordered_map<string, float> max_temp;
+
+unordered_map<string, vector<float>> all_values;
+unordered_map<string, pair<float, float>> statistics;
+
+bool range_exceeded = false;
+
+void dumpProfile() {
+	if (!profilingEnabled) {
+		return;
+	}
+	if (min_all.size() == 0) {
+		return;
+	}
+	ofstream outfile("dump.profile");
+	auto min_i = min_all.begin();
+	while (min_i != min_all.end()) {
+		string key = min_i->first;
+		outfile << key << "," << min_all[key] << "," << max_all[key] << endl;
+		min_i++;
+	}
+	outfile.close();
+}
+
+void flushProfile() {
+	if (!profilingEnabled) {
+		return;
+	}
+	if (range_exceeded == false) {
+		for (auto it = min_temp.begin(); it != min_temp.end(); it++) {
+			string name = it->first;
+			if (min_all.find(name) == min_all.end()) {
+				min_all[name] = min_temp[name];
+				max_all[name] = max_temp[name];
+			} else {
+				min_all[name] = min_all[name] < min_temp[name] ? min_all[name] : min_temp[name];
+				max_all[name] = max_all[name] > max_temp[name] ? max_all[name] : max_temp[name];
+			}
+			min_temp[name] = FLT_MAX;
+			max_temp[name] = -FLT_MAX;
+		}
+	} else {
+		for (auto it = min_temp.begin(); it != min_temp.end(); it++) {
+			string name = it -> first;
+			min_temp[name] = FLT_MAX;
+			max_temp[name] = -FLT_MAX;
+		}
+		range_exceeded = false;
+	}
+}
+
+void checkRange2(float* A, int I, int J) {
+	if (!profilingEnabled) {
+		return;
+	}
+	for (int i = 0; i < I; i++) {
+		for (int j = 0; j < J; j++) {
+			if (fabs(A[i * J + j]) >= 32) {
+				range_exceeded = true;
+			}
+		}
+	}
+}
+
+void Profile4(float* A, int I, int J, int K, int L, string name) {
+	if (!profilingEnabled) {
+		return;
+	}
+	if (min_temp.find(name) == min_temp.end()) {
+		min_temp[name] = FLT_MAX;
+		max_temp[name] = -FLT_MAX;
+		all_values[name] = vector<float>();
+	}
+	for (int i = 0; i < I; i++) {
+		for (int j = 0; j < J; j++) {
+			for (int k = 0; k < K; k++) {
+				for (int l = 0; l < L; l++) {
+					min_temp[name] = min_temp[name] < A[i * J * K * L + j * K * L + k * L + l] ? min_temp[name] : A[i * J * K * L + j * K * L + k * L + l];
+					max_temp[name] = max_temp[name] > A[i * J * K * L + j * K * L + k * L + l] ? max_temp[name] : A[i * J * K * L + j * K * L + k * L + l];
+					all_values[name].push_back(A[i * J * K * L + j * K * L + k * L + l]);
+				}
+			}
+		}
+	}
+}
+
+void Profile3(float* A, int I, int J, int K, string name) {
+	if (!profilingEnabled) {
+		return;
+	}
+	if (min_temp.find(name) == min_temp.end()) {
+		min_temp[name] = FLT_MAX;
+		max_temp[name] = -FLT_MAX;
+		all_values[name] = vector<float>();
+	}
+	for (int i = 0; i < I; i++) {
+		for (int j = 0; j < J; j++) {
+			for (int k = 0; k < K; k++) {
+				min_temp[name] = min_temp[name] < A[i * J * K + j * K + k] ? min_temp[name] : A[i * J * K + j * K + k];
+				max_temp[name] = max_temp[name] > A[i * J * K + j * K + k] ? max_temp[name] : A[i * J * K + j * K + k];
+				all_values[name].push_back(A[i * J * K + j * K + k]);
+			}
+		}
+	}
+}
+
+void Profile2(float* A, int I, int J, string name) {
+	if (!profilingEnabled) {
+		return;
+	}
+	if (min_temp.find(name) == min_temp.end()) {
+		min_temp[name] = FLT_MAX;
+		max_temp[name] = -FLT_MAX;
+		all_values[name] = vector<float>();
+	}
+	for (int i = 0; i < I; i++) {
+		for (int j = 0; j < J; j++) {
+			min_temp[name] = min_temp[name] < A[i * J + j] ? min_temp[name] : A[i * J + j];
+			max_temp[name] = max_temp[name] > A[i * J + j] ? max_temp[name] : A[i * J + j];
+			all_values[name].push_back(A[i * J + j]);
+		}
+	}
+}
+
+void diff(float* A, MYINT* B, MYINT scale, MYINT I, MYINT J) {
+	float min = numeric_limits<float>::max(), max = 0, sum = 0;
+	float min_relative = numeric_limits<float>::max(), max_relative = 0, sum_relative = 0;
+	int count = 0;
+
+	for (MYINT i = 0; i < I; i++) {
+		for (MYINT j = 0; j < J; j++) {
+			float a = A[i * J + j];
+
+			MYINT b = B[i * J + j];
+			float b_float = float(ldexp(double(b), scale));
+
+			float diff = abs(a - b_float);
+			float diff_relative = diff / abs(a);
+
+			if (diff < min) {
+				min = diff;
+			}
+			if (diff > max) {
+				max = diff;
+			}
+
+			if (diff_relative < min_relative) {
+				min_relative = diff_relative;
+			}
+			if (diff_relative > max_relative) {
+				max_relative = diff_relative;
+			}
+
+			sum += diff;
+			sum_relative += diff_relative;
+
+			count++;
+		}
+	}
+
+	float avg = sum / count;
+	float avg_relative = sum_relative / count;
+
+	cout << max << "\t" << avg << "\t" << min << "\t" << max_relative << "\t" << avg_relative << "\t" << min_relative << endl;
+
+	return;
+}
+
+void diff(float* A, MYINT* B, MYINT scale, MYINT I, MYINT J, MYINT K) {
+	float min = numeric_limits<float>::max(), max = 0, sum = 0;
+	float min_relative = numeric_limits<float>::max(), max_relative = 0, sum_relative = 0;
+	int count = 0;
+
+	for (MYINT i = 0; i < I; i++) {
+		for (MYINT j = 0; j < J; j++) {
+			for (MYINT k = 0; k < K; k++) {
+				float a = A[i * J * K + j * K + k];
+
+				MYINT b = B[i * J * K + j * K + k];
+				float b_float = float(ldexp(double(b), scale));
+
+				float diff = abs(a - b_float);
+				float diff_relative = diff / abs(a);
+
+				if (diff < min) {
+					min = diff;
+				}
+				if (diff > max) {
+					max = diff;
+				}
+
+				if (diff_relative < min_relative) {
+					min_relative = diff_relative;
+				}
+				if (diff_relative > max_relative) {
+					max_relative = diff_relative;
+				}
+
+				sum += diff;
+				sum_relative += diff_relative;
+
+				count++;
+			}
+		}
+	}
+
+	float avg = sum / count;
+	float avg_relative = sum_relative / count;
+
+	cout << max << "\t" << avg << "\t" << min << "\t" << max_relative << "\t" << avg_relative << "\t" << min_relative << endl;
+
+	return;
+}
--- a/tools/SeeDot/seedot/Predictor/profile.h
+++ b/tools/SeeDot/seedot/Predictor/profile.h
@ -3,11 +3,36 @@

 #pragma once

-#include <string>
-
+// Used by old SeeDot, invoked before the inference code is executed, it is used to
+// set the variables used for profiling to default values.
 void initializeProfiling();

+// Methods used by old SeeDot to capture the range of a variable in floating point mode.
+// Used only for exponentiation in old SeeDot.
 void updateRange(float x);
 void updateRangeOfExp(float x);
-
+// Used by old SeeDot to store the range of exponentiation variable.
 void dumpRange(std::string outputFile);
+
+// Store the range of all variables into a file.
+void dumpProfile();
+// This method is used to take the ranges of variables taken for one datapoint and store them into the global range,
+// if the range of exponentiation variables is within an acceptable threshold.
+void flushProfile();
+
+void debug();
+
+// Check whether the range of the variable is higher than a threshold, beyond which the datapoint is not considered for profiling.
+// Please check OOPSLA'20 paper Section 5.4 for details.
+void checkRange2(float* A, int I, int J);
+
+// Methods used to capture the range of a 4-D, 3-D and 2-D variables in the floating point mode which is used for data-driven scaling.
+void Profile4(float* A, int I, int J, int K, int L, std::string name);
+void Profile3(float* A, int I, int J, int K, std::string name);
+void Profile2(float* A, int I, int J, std::string name);
+
+// Used to capture the difference of corresponding variables in floating-point and fixed-point mode.
+void diff(float* A, MYINT* B, MYINT scale, MYINT I, MYINT J);
+void diff(float* A, MYINT* B, MYINT scale, MYINT I, MYINT J, MYINT K);
+
+extern bool profilingEnabled;
--- a/tools/SeeDot/seedot/Predictor/protonn_float.cpp
+++ b/tools/SeeDot/seedot/Predictor/protonn_float.cpp
@ -1,137 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#include <iostream>
-#include <cstring>
-#include <cmath>
-
-#include "datatypes.h"
-#include "predictors.h"
-#include "protonn_float_model.h"
-
-#define PROFILE 1
-
-#if  PROFILE
-#include "profile.h"
-#endif
-
-using namespace std;
-using namespace protonn_float;
-
-int protonnFloat(float *X) {
-	MYINT ite_idx, ite_val, index;
-
-	float WX[d];
-	memset(WX, 0, sizeof(float) * d);
-
-	ite_idx = 0;
-	ite_val = 0;
-	// Dimensionality reduction
-	for (MYINT i = 0; i < D; i++) {
-#if  PROFILE
-		updateRange(X[i]);
-#endif
-		float input = X[i];
-
-#if P_SPARSE_W
-		index = Widx[ite_idx];
-		while (index != 0) {
-#if  PROFILE
-			updateRange(WX[index - 1]);
-			updateRange(Wval[ite_val]);
-			updateRange(input);
-			updateRange(Wval[ite_val] * input);
-			updateRange(WX[index - 1] + Wval[ite_val] * input);
-#endif
-			WX[index - 1] += Wval[ite_val] * input;
-			ite_idx++;
-			ite_val++;
-			index = Widx[ite_idx];
-		}
-		ite_idx++;
-#else
-		for (MYINT j = 0; j < d; j++) {
-#if  PROFILE
-			updateRange(WX[j]);
-			updateRange(W[j][i]);
-			updateRange(input);
-			updateRange(W[j][i] * input);
-			updateRange(WX[j] + W[j][i] * input);
-#endif
-			WX[j] += W[j][i] * input;
-		}
-#endif
-	}
-
-
-#if P_NORM == 0
-#elif P_NORM == 1
-	for (MYINT i = 0; i < d; i++) {
-#if  PROFILE
-		updateRange(norm[i]);
-		updateRange(-norm[i]);
-#endif
-		WX[i] -= norm[i];
-	}
-#endif
-
-	float score[c];
-	memset(score, 0, sizeof(float) * c);
-
-	for (MYINT i = 0; i < p; i++) {
-
-		// Norm of WX - B
-		float v = 0;
-		for (MYINT j = 0; j < d; j++) {
-#if  PROFILE
-			updateRange(WX[j]);
-			updateRange(B[j][i]);
-			updateRange(WX[j] - B[j][i]);
-#endif
-			float t = WX[j] - B[j][i];
-#if  PROFILE
-			updateRange(v);
-			updateRange(t);
-			updateRange(t);
-			updateRange(t * t);
-			updateRange(v + t * t);
-#endif
-			v += t * t;
-		}
-
-		// Prediction distribution
-#if  PROFILE
-		updateRange(g2);
-		updateRange(v);
-		updateRange(-g2);
-		updateRange(-g2 * v);
-		updateRange(exp(-g2 * v));
-
-		updateRangeOfExp(g2 * v);
-#endif
-		float e = exp(-g2 * v);
-
-		for (MYINT j = 0; j < c; j++) {
-#if  PROFILE
-			updateRange(score[j]);
-			updateRange(Z[j][i]);
-			updateRange(e);
-			updateRange(Z[j][i] * e);
-			updateRange(score[j] + Z[j][i] * e);
-#endif
-			score[j] += Z[j][i] * e;
-		}
-	}
-
-	// Argmax of score
-	float max = score[0];
-	MYINT classID = 0;
-	for (MYINT i = 1; i < c; i++) {
-		if (score[i] > max) {
-			max = score[i];
-			classID = i;
-		}
-	}
-
-	return classID;
-}
--- a/tools/SeeDot/seedot/Predictor/protonn_float_model.h
+++ b/tools/SeeDot/seedot/Predictor/protonn_float_model.h
--- a/tools/SeeDot/seedot/Predictor/seedot_fixed.cpp
+++ b/tools/SeeDot/seedot/Predictor/seedot_fixed.cpp
@ -7,155 +7,61 @@

 #include "datatypes.h"
 #include "predictors.h"
-#include "library.h"
-#include "seedot_fixed_model.h"
+#include "profile.h"
+#include "library_fixed.h"
+#include "model_fixed.h"
+#include "vars_fixed.h"

 using namespace std;
-using namespace bonsai_fixed;
+using namespace seedot_fixed;
+using namespace vars_fixed;

-int seedotFixed(MYINT **X) {
-	MYINT tmp6[30][1];
-	MYINT tmp7[30][1];
-	MYINT node0;
-	MYINT tmp9[1][1];
-	MYINT tmp8[30];
-	MYINT tmp11[1][1];
-	MYINT tmp10[30];
-	MYINT tmp12[1][1];
-	MYINT tmp14[1][1];
-	MYINT tmp13[30];
-	MYINT node1;
-	MYINT tmp16[1][1];
-	MYINT tmp15[30];
-	MYINT tmp18[1][1];
-	MYINT tmp17[30];
-	MYINT tmp19[1][1];
-	MYINT tmp20[1][1];
-	MYINT tmp22[1][1];
-	MYINT tmp21[30];
-	MYINT node2;
-	MYINT tmp24[1][1];
-	MYINT tmp23[30];
-	MYINT tmp26[1][1];
-	MYINT tmp25[30];
-	MYINT tmp27[1][1];
-	MYINT tmp28[1][1];
-	MYINT tmp30[1][1];
-	MYINT tmp29[30];
-	MYINT node3;
-	MYINT tmp32[1][1];
-	MYINT tmp31[30];
-	MYINT tmp34[1][1];
-	MYINT tmp33[30];
-	MYINT tmp35[1][1];
-	MYINT tmp36[1][1];
-	MYINT tmp37;
+MYINT vars_fixed::tmp6[20][1];
+MYINT vars_fixed::tmp7[20][1];
+MYINT vars_fixed::node0;
+MYINT vars_fixed::tmp9[1][1];
+MYINT vars_fixed::tmp8[20];
+MYINT vars_fixed::tmp11[1][1];
+MYINT vars_fixed::tmp10[20];
+MYINT vars_fixed::tmp12[1][1];
+MYINT vars_fixed::tmp14[1][1];
+MYINT vars_fixed::tmp13[20];
+MYINT vars_fixed::node1;
+MYINT vars_fixed::tmp16[1][1];
+MYINT vars_fixed::tmp15[20];
+MYINT vars_fixed::tmp18[1][1];
+MYINT vars_fixed::tmp17[20];
+MYINT vars_fixed::tmp19[1][1];
+MYINT vars_fixed::tmp20[1][1];
+MYINT vars_fixed::tmp22[1][1];
+MYINT vars_fixed::tmp21[20];
+MYINT vars_fixed::node2;
+MYINT vars_fixed::tmp24[1][1];
+MYINT vars_fixed::tmp23[20];
+MYINT vars_fixed::tmp26[1][1];
+MYINT vars_fixed::tmp25[20];
+MYINT vars_fixed::tmp27[1][1];
+MYINT vars_fixed::tmp28[1][1];
+MYINT vars_fixed::tmp30[1][1];
+MYINT vars_fixed::tmp29[20];
+MYINT vars_fixed::node3;
+MYINT vars_fixed::tmp32[1][1];
+MYINT vars_fixed::tmp31[20];
+MYINT vars_fixed::tmp34[1][1];
+MYINT vars_fixed::tmp33[20];
+MYINT vars_fixed::tmp35[1][1];
+MYINT vars_fixed::tmp36[1][1];
+MYINT vars_fixed::tmp37;

-
-
-	// Z |*| X
-	memset(tmp6, 0, sizeof(MYINT) * 30);
-	SparseMatMul(&Zidx[0], &Zval[0], X, &tmp6[0][0], 257, 128, 128, 2);
-
-
-	// tmp6 - mean
-	MatSub(&tmp6[0][0], &mean[0][0], &tmp7[0][0], 30, 1, 1, 4, 1);
-
-	node0 = 0;
-
-	// W * ZX
-	MatMulCN(&W[node0][0][0], &tmp7[0][0], &tmp9[0][0], &tmp8[0], 1, 30, 1, 128, 128, 0, 5);
-
-
-	// V * ZX
-	MatMulCN(&V[node0][0][0], &tmp7[0][0], &tmp11[0][0], &tmp10[0], 1, 30, 1, 128, 64, 0, 5);
-
-
-	// tanh(V0)
-	TanH(&tmp11[0][0], 1, 1, 2048);
-
-
-	// W0 <*> V0_tanh
-	MulCir(&tmp9[0][0], &tmp11[0][0], &tmp12[0][0], 1, 1, 64, 32);
-
-
-	// T * ZX
-	MatMulCN(&T[node0][0][0], &tmp7[0][0], &tmp14[0][0], &tmp13[0], 1, 30, 1, 128, 128, 1, 4);
-
-	node1 = ((tmp14[0][0] > 0) ? ((2 * node0) + 1) : ((2 * node0) + 2));
-
-	// W * ZX
-	MatMulCN(&W[node1][0][0], &tmp7[0][0], &tmp16[0][0], &tmp15[0], 1, 30, 1, 128, 128, 0, 5);
-
-
-	// V * ZX
-	MatMulCN(&V[node1][0][0], &tmp7[0][0], &tmp18[0][0], &tmp17[0], 1, 30, 1, 128, 64, 0, 5);
-
-
-	// tanh(V1)
-	TanH(&tmp18[0][0], 1, 1, 2048);
-
-
-	// W1 <*> V1_tanh
-	MulCir(&tmp16[0][0], &tmp18[0][0], &tmp19[0][0], 1, 1, 64, 32);
-
-
-	// score0 + tmp19
-	MatAdd(&tmp12[0][0], &tmp19[0][0], &tmp20[0][0], 1, 1, 1, 1, 1);
-
-
-	// T * ZX
-	MatMulCN(&T[node1][0][0], &tmp7[0][0], &tmp22[0][0], &tmp21[0], 1, 30, 1, 128, 128, 1, 4);
-
-	node2 = ((tmp22[0][0] > 0) ? ((2 * node1) + 1) : ((2 * node1) + 2));
-
-	// W * ZX
-	MatMulCN(&W[node2][0][0], &tmp7[0][0], &tmp24[0][0], &tmp23[0], 1, 30, 1, 128, 128, 0, 5);
-
-
-	// V * ZX
-	MatMulCN(&V[node2][0][0], &tmp7[0][0], &tmp26[0][0], &tmp25[0], 1, 30, 1, 128, 64, 0, 5);
-
-
-	// tanh(V2)
-	TanH(&tmp26[0][0], 1, 1, 2048);
-
-
-	// W2 <*> V2_tanh
-	MulCir(&tmp24[0][0], &tmp26[0][0], &tmp27[0][0], 1, 1, 64, 32);
-
-
-	// score1 + tmp27
-	MatAdd(&tmp20[0][0], &tmp27[0][0], &tmp28[0][0], 1, 1, 1, 1, 1);
-
-
-	// T * ZX
-	MatMulCN(&T[node2][0][0], &tmp7[0][0], &tmp30[0][0], &tmp29[0], 1, 30, 1, 128, 128, 1, 4);
-
-	node3 = ((tmp30[0][0] > 0) ? ((2 * node2) + 1) : ((2 * node2) + 2));
-
-	// W * ZX
-	MatMulCN(&W[node3][0][0], &tmp7[0][0], &tmp32[0][0], &tmp31[0], 1, 30, 1, 128, 128, 0, 5);
-
-
-	// V * ZX
-	MatMulCN(&V[node3][0][0], &tmp7[0][0], &tmp34[0][0], &tmp33[0], 1, 30, 1, 128, 64, 0, 5);
-
-
-	// tanh(V3)
-	TanH(&tmp34[0][0], 1, 1, 2048);
-
-
-	// W3 <*> V3_tanh
-	MulCir(&tmp32[0][0], &tmp34[0][0], &tmp35[0][0], 1, 1, 64, 32);
-
-
-	// score2 + tmp35
-	MatAdd(&tmp28[0][0], &tmp35[0][0], &tmp36[0][0], 1, 1, 1, 1, 1);
-
-
-	// sgn(score3)
-	tmp37 = ((tmp36[0][0] > 0) ? 1 : 0);
-
-	return tmp37;
+void seedotFixed(MYINT** X, int32_t* res) {
+  res[0] = -1;
+}
+
+const int switches = 0;
+
+void seedotFixedSwitch(int i, MYINT** X_temp, int32_t* res) {
+  switch(i) {
+    default: res[0] = -1;
+             return;
+  }
 }
--- a/tools/SeeDot/seedot/Predictor/seedot_fixed_model.h
+++ b/tools/SeeDot/seedot/Predictor/seedot_fixed_model.h
--- a/tools/SeeDot/seedot/Predictor/seedot_float.cpp
+++ b/tools/SeeDot/seedot/Predictor/seedot_float.cpp
@ -0,0 +1,56 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include <iostream>
+
+#include "datatypes.h"
+#include "predictors.h"
+#include "profile.h"
+#include "library_float.h"
+#include "model_float.h"
+#include "vars_float.h"
+
+using namespace std;
+using namespace seedot_float;
+using namespace vars_float;
+
+float vars_float::tmp6[20][1];
+float vars_float::tmp7[20][1];
+MYINT vars_float::node0;
+float vars_float::tmp9[1][1];
+float vars_float::tmp8[20];
+float vars_float::tmp11[1][1];
+float vars_float::tmp10[20];
+float vars_float::tmp12[1][1];
+float vars_float::tmp14[1][1];
+float vars_float::tmp13[20];
+MYINT vars_float::node1;
+float vars_float::tmp16[1][1];
+float vars_float::tmp15[20];
+float vars_float::tmp18[1][1];
+float vars_float::tmp17[20];
+float vars_float::tmp19[1][1];
+float vars_float::tmp20[1][1];
+float vars_float::tmp22[1][1];
+float vars_float::tmp21[20];
+MYINT vars_float::node2;
+float vars_float::tmp24[1][1];
+float vars_float::tmp23[20];
+float vars_float::tmp26[1][1];
+float vars_float::tmp25[20];
+float vars_float::tmp27[1][1];
+float vars_float::tmp28[1][1];
+float vars_float::tmp30[1][1];
+float vars_float::tmp29[20];
+MYINT vars_float::node3;
+float vars_float::tmp32[1][1];
+float vars_float::tmp31[20];
+float vars_float::tmp34[1][1];
+float vars_float::tmp33[20];
+float vars_float::tmp35[1][1];
+float vars_float::tmp36[1][1];
+MYINT vars_float::tmp37;
+
+void seedotFloat(float** X, float* res) {
+  res[0] = -1.0;
+}
--- a/tools/SeeDot/seedot/Predictor/vars_fixed.h
+++ b/tools/SeeDot/seedot/Predictor/vars_fixed.h
@ -0,0 +1,46 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include "datatypes.h"
+
+namespace vars_fixed
+{
+extern MYINT tmp6[20][1];
+extern MYINT tmp7[20][1];
+extern MYINT node0;
+extern MYINT tmp9[1][1];
+extern MYINT tmp8[20];
+extern MYINT tmp11[1][1];
+extern MYINT tmp10[20];
+extern MYINT tmp12[1][1];
+extern MYINT tmp14[1][1];
+extern MYINT tmp13[20];
+extern MYINT node1;
+extern MYINT tmp16[1][1];
+extern MYINT tmp15[20];
+extern MYINT tmp18[1][1];
+extern MYINT tmp17[20];
+extern MYINT tmp19[1][1];
+extern MYINT tmp20[1][1];
+extern MYINT tmp22[1][1];
+extern MYINT tmp21[20];
+extern MYINT node2;
+extern MYINT tmp24[1][1];
+extern MYINT tmp23[20];
+extern MYINT tmp26[1][1];
+extern MYINT tmp25[20];
+extern MYINT tmp27[1][1];
+extern MYINT tmp28[1][1];
+extern MYINT tmp30[1][1];
+extern MYINT tmp29[20];
+extern MYINT node3;
+extern MYINT tmp32[1][1];
+extern MYINT tmp31[20];
+extern MYINT tmp34[1][1];
+extern MYINT tmp33[20];
+extern MYINT tmp35[1][1];
+extern MYINT tmp36[1][1];
+extern MYINT tmp37;
+}
--- a/tools/SeeDot/seedot/Predictor/vars_float.h
+++ b/tools/SeeDot/seedot/Predictor/vars_float.h
@ -0,0 +1,46 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include "datatypes.h"
+
+namespace vars_float
+{
+extern float tmp6[20][1];
+extern float tmp7[20][1];
+extern MYINT node0;
+extern float tmp9[1][1];
+extern float tmp8[20];
+extern float tmp11[1][1];
+extern float tmp10[20];
+extern float tmp12[1][1];
+extern float tmp14[1][1];
+extern float tmp13[20];
+extern MYINT node1;
+extern float tmp16[1][1];
+extern float tmp15[20];
+extern float tmp18[1][1];
+extern float tmp17[20];
+extern float tmp19[1][1];
+extern float tmp20[1][1];
+extern float tmp22[1][1];
+extern float tmp21[20];
+extern MYINT node2;
+extern float tmp24[1][1];
+extern float tmp23[20];
+extern float tmp26[1][1];
+extern float tmp25[20];
+extern float tmp27[1][1];
+extern float tmp28[1][1];
+extern float tmp30[1][1];
+extern float tmp29[20];
+extern MYINT node3;
+extern float tmp32[1][1];
+extern float tmp31[20];
+extern float tmp34[1][1];
+extern float tmp33[20];
+extern float tmp35[1][1];
+extern float tmp36[1][1];
+extern MYINT tmp37;
+}
--- a/tools/SeeDot/seedot/Streamer/.gitignore
+++ b/tools/SeeDot/seedot/Streamer/.gitignore
@ -0,0 +1,2 @@
+input/
+output/
--- a/tools/SeeDot/seedot/Streamer/DeviceInterface.cs
+++ b/tools/SeeDot/seedot/Streamer/DeviceInterface.cs
@ -0,0 +1,211 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+using System;
+using System.Text;
+using System.IO.Ports;
+
+/*
+ * This class streams data to the arduino device connected, and reads back the prediction to compute the accuracy of the arduino implementation
+ */
+namespace Streamer
+{
+	public class DeviceInterface
+	{
+		// Note: The baud rate used here must match with the baud rate specified in the .ino file
+		int baud = 115200;
+
+		SerialPort port = null;
+		DataType dataType;
+
+		/*
+		 * Returns the Serial Port which is running the prediction code for the given algo and dataset, if any.
+		 * Else, exits by throwing an exception.
+		*/
+		public DeviceInterface(string algo, string dataset, DataType dataType)
+		{
+			this.dataType = dataType;
+
+			// Handshake messages used to establish connection with the Arduino device
+			string syncMsg = algo;
+			string acknMsg = dataset;
+
+			// Trying to connect through each serial port
+			foreach (string portName in SerialPort.GetPortNames())
+			{
+				try
+				{
+					port = new SerialPort(portName, baud);
+					port.Open();
+					port.ReadTimeout = 500;
+
+					// Flush all the data in the buffer before synchronizing
+					if (port.BytesToRead != 0)
+					{
+						var numBytes = port.BytesToRead;
+						byte[] dummy = new byte[numBytes];
+						port.Read(dummy, 0, numBytes);
+					}
+
+					// Convert the message into an array of bytes in ASCII format
+					// Write the bytes into the serial buffer
+					byte[] syncMsgBytes = Encoding.ASCII.GetBytes(syncMsg);
+					port.Write(syncMsgBytes, 0, syncMsgBytes.Length);
+
+					// Check if the device acknowledges with a reply containing the acknMsg
+					// The reply will contain a carriage return (\r) in the end. Hence, using Contains() instead of Equals().
+					var reply = port.ReadLine();
+					if (reply.Contains(acknMsg))
+					{
+						//System.Threading.Thread.Sleep(5000);
+						return;
+					}
+				}
+				catch (Exception)
+				{
+					try { port.Close(); } catch (Exception e) { Console.WriteLine(e.StackTrace); }
+				}
+			}
+
+			throw new Exception("Couldn't find the device!");
+		}
+
+		/*
+		 * Streams the feature vector to the device for prediction and returns the predicted class ID along with the time taken for prediction.
+		 */
+		public int PredictOnDevice(string features, out ulong predictionTime)
+		{
+			try
+			{
+				try
+				{
+					// Convert each feature to bytes and stream to the device
+					foreach (string featureStr in features.Split(new string[] { ", " }, StringSplitOptions.None))
+					{
+						string feature = featureStr;
+						if (feature.Length == 0)
+							throw new Exception("No features present in the data point");
+
+						feature = ProcessFeature(feature);
+
+						// Write the feature to the serial buffer
+						byte[] bytes = Encoding.ASCII.GetBytes(feature.ToString());
+						port.Write(bytes, 0, bytes.Length);
+
+						//System.Threading.Thread.Sleep(1);
+					}
+				}
+				catch (Exception e) { Console.WriteLine(e.StackTrace); }
+
+				while (port.BytesToRead == 0) ;
+
+				// Note: Prediction code identifies class indices from 0.
+				// Hence, add one to the predicted label.
+				var output = port.ReadLine();
+				int classID = int.Parse(output);
+
+				output = port.ReadLine();
+				predictionTime = ulong.Parse(output);
+
+				return classID;
+			}
+			catch (Exception e)
+			{
+				Console.WriteLine(e.StackTrace);
+
+				if (port != null)
+					port.Close();
+			}
+
+			throw new Exception("Unable to perform prediction on the device");
+		}
+
+		private string ProcessFeature(string feature)
+		{
+			if (dataType == DataType.Float)
+				return ProcessFloatFeature(feature);
+			else if (dataType == DataType.Int16)
+				return ProcessInt16Feature(feature);
+			else
+				return ProcessInt32Feature(feature);
+		}
+
+		private string ProcessFloatFeature(string feature)
+		{
+			float max = 9999.000000f;
+			int featureLength = 12;
+
+			float value = Math.Abs(float.Parse(feature));
+			if (value > max)
+				throw new Exception("Float value greater than maximum: " + value);
+
+			// If the feature is an integer, convert to float by adding a decimal point
+			if (!feature.Contains("."))
+				feature = feature + ".";
+
+			// Extend the feature by adding trailing zeroes
+			if (feature.Length < featureLength)
+				feature = feature.PadRight(featureLength, '0');
+
+			// Add a null terminator in the end
+			feature = feature + '\0';
+
+			if (feature.Length > (featureLength + 1))
+				throw new Exception("Total length of feature greater than limit: " + feature);
+
+			return feature;
+		}
+
+		private string ProcessInt16Feature(string feature)
+		{
+			Int32 max = 9999999;
+			int featureLength = 9;
+
+			Int32 value = Math.Abs(Int32.Parse(feature));
+			if (value > max)
+				throw new Exception("Int16 value greater than maximum: " + value);
+
+			// If the feature is an integer, convert to float by adding a decimal point
+			if (!feature.Contains("."))
+				feature = feature + ".";
+
+			// Extend the feature by adding trailing zeroes
+			if (feature.Length < featureLength)
+				feature = feature.PadRight(featureLength, '0');
+
+			// Add a null terminator in the end
+			feature = feature + '\0';
+
+			if (feature.Length > (featureLength + 1))
+				throw new Exception("Total length of feature greater than limit: " + feature);
+
+			return feature;
+		}
+
+		private string ProcessInt32Feature(string feature)
+		{
+			Int64 max = 99999999999;
+			int featureLength = 13;
+
+			Int64 value = Math.Abs(Int64.Parse(feature));
+			if (value > max)
+				throw new Exception("Int32 value greater than maximum: " + value);
+
+			// If the feature is an integer, convert to float by adding a decimal point
+			if (!feature.Contains("."))
+				feature = feature + ".";
+
+			// Extend the feature by adding trailing zeroes
+			if (feature.Length < featureLength)
+				feature = feature.PadRight(featureLength, '0');
+
+			// Add a null terminator in the end
+			feature = feature + '\0';
+
+			if (feature.Length > (featureLength + 1))
+				throw new Exception("Total length of feature greater than limit: " + feature);
+
+			return feature;
+		}
+	}
+}
--- a/tools/SeeDot/seedot/Streamer/Main.cs
+++ b/tools/SeeDot/seedot/Streamer/Main.cs
@ -0,0 +1,122 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+using System;
+using System.IO;
+
+/*
+ * Controller for DeviceInterface class which streams data to arduino device
+ */
+namespace Streamer
+{
+	public enum DataType { Float, Int16, Int32 };
+
+	class Program
+	{
+		DataType datatype = DataType.Float;
+
+		static int Main(string[] args)
+		{
+			new Program().Run();
+			return 0;
+		}
+
+		string[] X, Y;
+		string outputFile;
+
+		public void Run()
+		{
+			if (datatype == DataType.Float)
+				Console.WriteLine("Reading data in FLOAT format");
+			else if (datatype == DataType.Int16)
+				Console.WriteLine("Reading data in INT16 format");
+			else
+				Console.WriteLine("Reading data in INT32 format");
+
+			string projectDir = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.FullName;
+			string inputDir = Path.Combine(projectDir, "input");
+
+			string outputDir = "output";
+			Directory.CreateDirectory(Path.Combine(projectDir, outputDir));
+
+			outputFile = Path.Combine(projectDir, outputDir, "prediction-info.txt");
+
+			ReadDataset(Path.Combine(inputDir, "X.csv"), Path.Combine(inputDir, "Y.csv"));
+
+			DeviceInterface device = new DeviceInterface("fixed", "point", datatype);
+
+			PerformPrediction(device);
+
+			return;
+		}
+
+		// Read and validate the dataset
+		public void ReadDataset(string X_file, string Y_file)
+		{
+			X = File.ReadAllLines(X_file);
+			Y = File.ReadAllLines(Y_file);
+
+			if (X.Length != Y.Length)
+				throw new Exception("Number of data points not equal to the number of labels");
+
+			int featuresLength = X[0].Split(new string[] { ", " }, StringSplitOptions.None).Length;
+			int labelsLength = 1;
+
+			// Validating the dataset
+			for (int i = 0; i < X.Length; i++)
+			{
+				int X_length = X[i].Split(new string[] { ", " }, StringSplitOptions.None).Length;
+				int Y_length = Y[i].Split(new string[] { ", " }, StringSplitOptions.None).Length;
+
+				if (X_length != featuresLength || Y_length != labelsLength)
+					throw new Exception("Number of features or number of labels not consistent");
+
+				Y[i] = Y[i].Split(new string[] { ", " }, StringSplitOptions.None)[0];
+			}
+
+			return;
+		}
+
+		public void PerformPrediction(DeviceInterface device)
+		{
+			int correct = 0, total = 0;
+			ulong totalPredictionTime = 0;
+
+			using (StreamWriter file = new StreamWriter(outputFile))
+			{
+				// Read each data point, predict on device and compare the class ID
+				for (int i = 0; i < X.Length; i++)
+				{
+					int classID = device.PredictOnDevice(X[i], out ulong predictionTime);
+					var label = Y[i];
+
+					if (classID.ToString().Equals(label))
+					{
+						Console.WriteLine((i + 1) + ": Correct prediction in " + predictionTime + " \u00b5sec");
+						correct++;
+					}
+					else
+					{
+						Console.WriteLine((i + 1) + ": Incorrect prediction" + classID + "/" + label);
+						//file.WriteLine("Incorrect prediction for input " + (i + 1));
+						file.WriteLine("Incorrect prediction for input " + (total + 1) + ". Predicted " + classID + " Expected " + label);
+					}
+
+					totalPredictionTime += predictionTime;
+					total++;
+					Console.WriteLine("Accuracy: " + (((float)correct / total) * 100));
+				}
+
+				file.WriteLine("\n\n#test points = " + total);
+				file.WriteLine("Correct predictions = " + correct);
+				file.WriteLine("Accuracy = " + (((float)correct / total) * 100).ToString("0.000") + "\n");
+
+				Console.WriteLine("\n\nCorrect: " + correct);
+				Console.WriteLine("Accuracy: " + (((float)correct / total) * 100));
+				Console.WriteLine("Average prediction time: " + ((float)totalPredictionTime / total) + " \u00b5sec\n");
+			}
+
+			return;
+		}
+	}
+}
--- a/tools/SeeDot/seedot/arduino/.gitignore
+++ b/tools/SeeDot/seedot/arduino/.gitignore
@ -0,0 +1,3 @@
+predict.cpp
+model.h
+library.h
--- a/tools/SeeDot/seedot/arduino/arduino.ino
+++ b/tools/SeeDot/seedot/arduino/arduino.ino
@ -97,31 +97,30 @@ void predictionTime() {

 // In accuracy mode, this function reads an integer from the serial port.
 // In the prediction time mode, this function reads an integer from the array X stored in device's flash memory.
-MYINT getIntFeature(MYINT i) {
+int32_t getIntFeature(MYITE i) {
 #ifdef ACCURACY
-#ifdef INT16
-	char buff[10];
+	char buff[13];
 	while (!Serial.available())
 		;
-	Serial.readBytes(buff, 10);
-	return (MYINT)(atol(buff));
-#endif
-#ifdef INT32
-	char buff[14];
-	while (!Serial.available())
-		;
-	Serial.readBytes(buff, 14);
-	return (MYINT)(atoll(buff));
-#endif
+	Serial.readBytes(buff, 13);
+	double f = (float)(atof(buff));
 #endif

 #ifdef PREDICTIONTIME
-#ifdef INT16
-	return ((MYINT) pgm_read_word_near(&X[i]));
-#endif
-#ifdef INT32
-	return ((MYINT) pgm_read_dword_near(&X[i]));
+  #ifdef XFLOAT
+    double f = ((float) pgm_read_float_near(&X[i]));
+  #endif
+  #ifdef XINT8
+    return ((int8_t) pgm_read_byte_near(&Xint[i]));
+  #endif
+  #ifdef XINT16
+    return ((int16_t) pgm_read_word_near(&Xint[i]));
+  #endif
 #endif
+
+#ifdef XFLOAT
+  double f_int = ldexp(f, -scaleOfX);
+  return (int32_t)(f_int);
 #endif
 }

--- a/tools/SeeDot/seedot/arduino/compileConfig.h
+++ b/tools/SeeDot/seedot/arduino/compileConfig.h
@ -0,0 +1,6 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+// The datatype of the fixed-point representation is specified below.
+#define INT16
+#define XFLOAT
--- a/tools/SeeDot/seedot/arduino/config.h
+++ b/tools/SeeDot/seedot/arduino/config.h
@ -15,22 +15,21 @@
 //#define ACCURACY
 #define PREDICTIONTIME

+#include "compileConfig.h"

-// The datatype of the fixed-point representation is specified below.
-// The selection below should be equal to the Common.wordLength variable in Common.py
-// Uncomment the below #define to choose the datatype. Default is INT16.
-
-
-#define INT16
-//#define INT32
-
+#ifdef INT8
+typedef int8_t MYINT;
+typedef uint8_t MYUINT;
+#endif

 #ifdef INT16
 typedef int16_t MYINT;
+typedef uint16_t MYUINT;
 #endif

 #ifdef INT32
 typedef int32_t MYINT;
+typedef uint32_t MYUINT;
 #endif

-typedef uint16_t MYUINT;
+typedef int16_t MYITE;
--- a/tools/SeeDot/seedot/arduino/floating-point/bonsai_float.cpp
+++ b/tools/SeeDot/seedot/arduino/floating-point/bonsai_float.cpp
@ -1,142 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#include <Arduino.h>
-
-#include "config.h"
-#include "predict.h"
-#include "model.h"
-
-#define TANH 0
-
-using namespace model;
-
-int predict() {
-	MYINT ite_idx, ite_val, index;
-
-	float ZX[d];
-	memset(ZX, 0, sizeof(float) * d);
-
-	ite_idx = 0;
-	ite_val = 0;
-	// Dimensionality reduction
-	for (MYINT i = 0; i < D; i++) {
-		float input = getFloatFeature(i);
-
-#if B_SPARSE_Z
-		index = ((MYINT) pgm_read_word_near(&Zidx[ite_idx]));
-		while (index != 0) {
-			ZX[index - 1] += ((float) pgm_read_float_near(&Zval[ite_val])) * input;
-			ite_idx++;
-			ite_val++;
-			index = ((MYINT) pgm_read_word_near(&Zidx[ite_idx]));
-		}
-		ite_idx++;
-#else
-		for (MYINT j = 0; j < d; j++) {
-			ZX[j] += ((float) pgm_read_float_near(&Z[j][i])) * input;
-		}
-#endif
-	}
-
-	for (MYINT i = 0; i < d; i++)
-		ZX[i] -= ((float) pgm_read_float_near(&mean[i]));
-
-
-	MYINT currNode = 0;
-	float WZX[c], VZX[c], score[c];
-
-	memset(score, 0, sizeof(float) * c);
-
-	while (currNode < internalNodes) {
-
-		memset(WZX, 0, sizeof(float) * c);
-		memset(VZX, 0, sizeof(float) * c);
-
-		// Accumulating score at each node
-		for (MYINT i = 0; i < d; i++) {
-			for (MYINT j = currNode * c; j < (currNode + 1) * c; j++) {
-				WZX[j % c] += ((float) pgm_read_float_near(&W[j][i])) * ZX[i];
-				VZX[j % c] += ((float) pgm_read_float_near(&V[j][i])) * ZX[i];
-			}
-		}
-
-		for (MYINT i = 0; i < c; i++) {
-			float t;
-			if (VZX[i] > tanh_limit)
-				t = tanh_limit;
-			else if (VZX[i] < -tanh_limit)
-				t = -tanh_limit;
-			else
-				t = VZX[i];
-
-#if TANH
-			score[i] += WZX[i] * tanh(VZX[i]);
-#else
-			score[i] += WZX[i] * t;
-#endif
-		}
-
-		// Computing theta value for branching into a child node
-		float val = 0;
-		for (MYINT i = 0; i < d; i++)
-			val += ((float) pgm_read_float_near(&T[currNode][i])) * ZX[i];
-
-		if (val > 0)
-			currNode = 2 * currNode + 1;
-		else
-			currNode = 2 * currNode + 2;
-	}
-
-	memset(WZX, 0, sizeof(float) * c);
-	memset(VZX, 0, sizeof(float) * c);
-
-	// Accumulating score for the last node
-	for (MYINT i = 0; i < d; i++) {
-		for (MYINT j = currNode * c; j < (currNode + 1) * c; j++) {
-			WZX[j % c] += ((float) pgm_read_float_near(&W[j][i])) * ZX[i];
-			VZX[j % c] += ((float) pgm_read_float_near(&V[j][i])) * ZX[i];
-		}
-	}
-
-	for (MYINT i = 0; i < c; i++) {
-		float t;
-		if (VZX[i] > tanh_limit)
-			t = tanh_limit;
-		else if (VZX[i] < -tanh_limit)
-			t = -tanh_limit;
-		else
-			t = VZX[i];
-
-#if TANH
-		score[i] += WZX[i] * tanh(VZX[i]);
-#else
-		score[i] += WZX[i] * t;
-#endif
-	}
-
-	MYINT classID;
-
-	// Finding the class ID
-	// If binary classification, the sign of the score is used
-	// If multiclass classification, argmax of score is used
-	if (c <= 2) {
-		if (score[0] > 0)
-			classID = 1;
-		else
-			classID = 0;
-	}
-	else {
-		float max = score[0];
-		MYINT maxI = 0;
-		for (MYINT i = 1; i < c; i++) {
-			if (score[i] > max) {
-				max = score[i];
-				maxI = i;
-			}
-		}
-		classID = maxI;
-	}
-
-	return classID;
-}
--- a/tools/SeeDot/seedot/arduino/floating-point/protonn_float.cpp
+++ b/tools/SeeDot/seedot/arduino/floating-point/protonn_float.cpp
@ -1,75 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#include <Arduino.h>
-
-#include "config.h"
-#include "predict.h"
-#include "model.h"
-
-using namespace model;
-
-int predict() {
-	MYINT ite_idx, ite_val, index;
-
-	float WX[d];
-	memset(WX, 0, sizeof(float) * d);
-
-	ite_idx = 0;
-	ite_val = 0;
-	// Dimensionality reduction
-	for (MYINT i = 0; i < D; i++) {
-		float input = getFloatFeature(i);
-
-#if P_SPARSE_W
-		index = ((MYINT) pgm_read_word_near(&Widx[ite_idx]));
-		while (index != 0) {
-			WX[index - 1] += ((float) pgm_read_float_near(&Wval[ite_val])) * input;
-			ite_idx++;
-			ite_val++;
-			index = ((MYINT) pgm_read_word_near(&Widx[ite_idx]));
-		}
-		ite_idx++;
-#else
-		for (MYINT j = 0; j < d; j++)
-			WX[j] += ((float) pgm_read_float_near(&W[j][i])) * input;
-#endif
-	}
-
-#if P_NORM == 0
-#elif P_NORM == 1
-	for (MYINT i = 0; i < d; i++)
-		WX[i] -= ((float) pgm_read_float_near(&norm[i]));
-#endif
-
-	float score[c];
-	memset(score, 0, sizeof(float) * c);
-
-	for (MYINT i = 0; i < p; i++) {
-
-		// Norm of WX - B
-		float v = 0;
-		for (MYINT j = 0; j < d; j++) {
-			float t = WX[j] - ((float) pgm_read_float_near(&B[j][i]));
-			v += t * t;
-		}
-
-		// Prediction distribution
-		float e = exp(-g2 * v);
-
-		for (MYINT j = 0; j < c; j++)
-			score[j] += ((float) pgm_read_float_near(&Z[j][i])) * e;
-	}
-
-	// Argmax of score
-	float max = score[0];
-	MYINT classID = 0;
-	for (MYINT i = 1; i < c; i++) {
-		if (score[i] > max) {
-			max = score[i];
-			classID = i;
-		}
-	}
-
-	return classID;
-}
--- a/tools/SeeDot/seedot/arduino/library.h
+++ b/tools/SeeDot/seedot/arduino/library.h
@ -1,542 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT license.
-
-#pragma once
-
-#include <Arduino.h>
-
-#include "config.h"
-#include "predict.h"
-
-// This file contains implementations of the linear algebra operators supported by SeeDot.
-// Each function takes the scaling factors as arguments along with the pointers to the operands.
-
-// C = A + B
-inline __attribute__((always_inline)) void MatAdd(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = B[i * J + j];
-
-			a = a / shrA;
-			b = b / shrB;
-
-			MYINT c = a + b;
-			c = c / shrC;
-
-			C[i * J + j] = c;
-		}
-	}
-	return;
-}
-
-// C = A - B
-inline __attribute__((always_inline)) void MatSub(MYINT *A, const MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = ((MYINT) pgm_read_word_near(&B[i * J + j]));
-
-			a = a / shrA;
-			b = b / shrB;
-
-			MYINT c = a - b;
-			c = c / shrC;
-
-			C[i * J + j] = c;
-		}
-	}
-	return;
-}
-
-// C = A * B
-inline __attribute__((always_inline)) void MatMulNN(MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-inline __attribute__((always_inline)) void MatMulCN(const MYINT *A, MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = ((MYINT) pgm_read_word_near(&A[i * K + k]));
-				MYINT b = B[k * J + j];
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-inline __attribute__((always_inline)) void MatMulNC(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = A[i * K + k];
-				MYINT b = ((MYINT) pgm_read_word_near(&B[k * J + j]));
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A * B
-inline __attribute__((always_inline)) void MatMulCC(const MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			for (MYINT k = 0; k < K; k++) {
-				MYINT a = ((MYINT) pgm_read_word_near(&A[i * K + k]));
-				MYINT b = ((MYINT) pgm_read_word_near(&B[k * J + j]));
-
-				a = a / shrA;
-				b = b / shrB;
-
-				tmp[k] = a * b;
-			}
-
-			MYINT count = K, depth = 0;
-			bool shr = true;
-
-			while (depth < (H1 + H2)) {
-				if (depth >= H1)
-					shr = false;
-
-				for (MYINT p = 0; p < (K / 2 + 1); p++) {
-					MYINT sum;
-					if (p < (count >> 1))
-						sum = tmp[2 * p] + tmp[(2 * p) + 1];
-					else if ((p == (count >> 1)) && ((count & 1) == 1))
-						sum = tmp[2 * p];
-					else
-						sum = 0;
-
-					if (shr)
-						tmp[p] = sum / 2;
-					else
-						tmp[p] = sum;
-				}
-				count = (count + 1) >> 1;
-
-				depth++;
-			}
-
-			C[i * J + j] = tmp[0];
-		}
-	}
-	return;
-}
-
-// C = A |*| B
-inline __attribute__((always_inline)) void SparseMatMul(const MYINT *Aidx, const MYINT *Aval, MYINT *C, MYINT K, MYINT shrA, MYINT shrB, MYINT shrC) {
-
-	MYINT ite_idx = 0, ite_val = 0;
-	for (MYINT k = 0; k < K; k++) {
-		MYINT b = getIntFeature(k);
-		//MYINT b = B[k * 1][0];
-		b = b / shrB;
-
-		MYINT idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
-		while (idx != 0) {
-			MYINT a = ((MYINT) pgm_read_word_near(&Aval[ite_val]));
-			a = a / shrA;
-
-			MYINT c = a * b;
-			c = c / shrC;
-
-			C[idx - 1] += c;
-
-			ite_idx++;
-			ite_val++;
-
-			idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
-		}
-		ite_idx++;
-	}
-
-	return;
-}
-
-// C = A <*> B
-inline __attribute__((always_inline)) void MulCir(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT a = A[i * J + j];
-			MYINT b = B[i * J + j];
-
-			a = a / shrA;
-			b = b / shrB;
-
-			C[i * J + j] = a * b;
-		}
-	}
-	return;
-}
-
-// A = tanh(A)
-inline __attribute__((always_inline)) void TanH(MYINT *A, MYINT I, MYINT J, MYINT tanh_limit) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT x = A[i * J + j], y;
-
-			if (x >= tanh_limit)
-				y = tanh_limit;
-			else if (x <= -tanh_limit)
-				y = -tanh_limit;
-			else
-				y = x;
-
-			A[i * J + j] = y;
-		}
-	}
-	return;
-}
-
-// index = argmax(A)
-inline __attribute__((always_inline)) void ArgMax(MYINT *A, MYINT I, MYINT J, MYINT *index) {
-
-	MYINT max = A[0], maxIndex = 0;
-	MYINT counter = 0;
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT x = A[i * J + j];
-
-			if (max < x) {
-				maxIndex = counter;
-				max = x;
-			}
-
-			counter++;
-		}
-	}
-
-	*index = maxIndex;
-
-	return;
-}
-
-// A = A^T
-inline __attribute__((always_inline)) void Transpose(MYINT *A, MYINT *B, MYINT I, MYINT J) {
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			B[i * J + j] = A[j * I + i];
-		}
-	}
-	return;
-}
-
-// C = a * B
-inline __attribute__((always_inline)) void ScalarMul(MYINT *A, MYINT *B, MYINT *C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
-
-	MYINT a = *A;
-	a = a / shrA;
-
-	for (MYINT i = 0; i < I; i++) {
-		for (MYINT j = 0; j < J; j++) {
-			MYINT b = B[i * J + j];
-			b = b / shrB;
-
-			C[i * J + j] = a * b;
-		}
-	}
-
-	return;
-}
-
-// C = A # B
-// A[N][H][W][CI], B[HF][WF][CI][CO], C[N][H][W][CO]
-inline __attribute__((always_inline)) void Conv(MYINT *A, const MYINT *B, MYINT *C, MYINT *tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
-	MYINT padH = (HF - 1) / 2;
-	MYINT padW = (WF - 1) / 2;
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT co = 0; co < CO; co++) {
-
-					MYINT counter = 0;
-					for (MYINT hf = 0; hf < HF; hf++) {
-						for (MYINT wf = 0; wf < WF; wf++) {
-							for (MYINT ci = 0; ci < CI; ci++) {
-								MYINT a = (((((h + hf) < padH) || ((h + hf) >= (H + padH))) || (((w + wf) < padW) || ((w + wf) >= (W + padW)))) ? 0 : A[n * H * W * CI + ((h + hf) - padH) * W * CI + ((w + wf) - padW) * CI + ci]);
-								a = a / shrA;
-
-								MYINT b = ((MYINT) pgm_read_word_near(&B[hf * WF * CI * CO + wf * CI * CO + ci * CO + co]));
-								b = b / shrB;
-
-								tmp[counter] = a * b;
-								counter++;
-							}
-						}
-					}
-
-					MYINT totalEle = HF * WF * CI;
-					MYINT count = HF * WF * CI, depth = 0;
-					bool shr = true;
-
-					while (depth < (H1 + H2)) {
-						if (depth >= H1)
-							shr = false;
-
-						for (MYINT p = 0; p < (totalEle / 2 + 1); p++) {
-							MYINT sum;
-							if (p < (count >> 1))
-								sum = tmp[2 * p] + tmp[(2 * p) + 1];
-							else if ((p == (count >> 1)) && ((count & 1) == 1))
-								sum = tmp[2 * p];
-							else
-								sum = 0;
-
-							if (shr)
-								tmp[p] = sum / 2;
-							else
-								tmp[p] = sum;
-						}
-						count = (count + 1) >> 1;
-
-						depth++;
-					}
-
-					C[n * H * W * CO + h * W * CO + w * CO + co] = tmp[0];
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = A <+> B
-// A[N][H][W][C], B[C]
-inline __attribute__((always_inline)) void AddOrSubCir4D(MYINT *A, const MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT c = 0; c < C; c++) {
-					MYINT a = A[n * H * W * C + h * W * C + w * C + c];
-					a = a / shrA;
-
-					MYINT b = ((MYINT) pgm_read_word_near(&B[c]));
-					b = b / shrB;
-
-					MYINT res;
-					if (add)
-						res = a + b;
-					else
-						res = a - b;
-
-					res = res / shrC;
-
-					A[n * H * W * C + h * W * C + w * C + c] = res;
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = A <+> B
-// A[N][H][W][C], B[C]
-inline __attribute__((always_inline)) void AddOrSubCir2D(MYINT *A, const MYINT *B, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
-
-	for (MYINT h = 0; h < H; h++) {
-		for (MYINT w = 0; w < W; w++) {
-			MYINT a = A[h * W + w];
-			a = a / shrA;
-
-			MYINT b = ((MYINT) pgm_read_word_near(&B[w]));
-			b = b / shrB;
-
-			MYINT res;
-			if (add)
-				res = a + b;
-			else
-				res = a - b;
-
-			res = res / shrC;
-
-			A[h * W + w] = res;
-		}
-	}
-
-	return;
-}
-
-// A = relu(A)
-// A[N][H][W][C]
-inline __attribute__((always_inline)) void Relu4D(MYINT *A, MYINT N, MYINT H, MYINT W, MYINT C) {
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT h = 0; h < H; h++) {
-			for (MYINT w = 0; w < W; w++) {
-				for (MYINT c = 0; c < C; c++) {
-					MYINT a = A[n * H * W * C + h * W * C + w * C + c];
-					if (a < 0)
-						a = 0;
-
-					A[n * H * W * C + h * W * C + w * C + c] = a;
-				}
-			}
-		}
-	}
-
-	return;
-}
-
-// A = relu(A)
-// A[N][H][W][C]
-inline __attribute__((always_inline)) void Relu2D(MYINT *A, MYINT H, MYINT W) {
-
-	for (MYINT h = 0; h < H; h++) {
-		for (MYINT w = 0; w < W; w++) {
-			MYINT a = A[h * W + w];
-			if (a < 0)
-				a = 0;
-
-			A[h * W + w] = a;
-		}
-	}
-
-	return;
-}
-
-// B = maxpool(A)
-// A[N][H][W][C], B[N][H][W][C]
-inline __attribute__((always_inline)) void Maxpool(MYINT *A, MYINT *B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT stride) {
-	MYINT HO = H / stride;
-	MYINT WO = W / stride;
-
-	for (MYINT n = 0; n < N; n++) {
-		for (MYINT ho = 0; ho < HO; ho++) {
-			for (MYINT wo = 0; wo < WO; wo++) {
-				for (MYINT c = 0; c < C; c++) {
-
-					MYINT max = A[n * H * W * C + (stride * ho) * W * C + (stride * wo) * C + c];
-					for (MYINT hs = 0; hs < stride; hs++) {
-						for (MYINT ws = 0; ws < stride; ws++) {
-							MYINT a = A[n * H * W * C + ((stride * ho) + hs) * W * C + ((stride * wo) + ws) * C + c];
-							if (a > max)
-								max = a;
-						}
-					}
-
-					B[n * HO * WO * C + ho * WO * C + wo * C + c] = max;
-				}
-			}
-		}
-	}
-
-	return;
-}
--- a/tools/SeeDot/seedot/arduino/library/library_fixed.h
+++ b/tools/SeeDot/seedot/arduino/library/library_fixed.h
--- a/tools/SeeDot/seedot/arduino/library/library_float.h
+++ b/tools/SeeDot/seedot/arduino/library/library_float.h
@ -0,0 +1,688 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <Arduino.h>
+
+#include "config.h"
+#include "predict.h"
+
+// C = A + B
+inline __attribute__((always_inline)) void MatAddNN(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+inline __attribute__((always_inline)) void MatAddCN(const float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = ((float) pgm_read_float_near(&A[i * J + j]));
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+inline __attribute__((always_inline)) void MatAddNC(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = ((float) pgm_read_float_near(&B[i * J + j]));
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + B
+inline __attribute__((always_inline)) void MatAddCC(const float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = ((float) pgm_read_float_near(&A[i * J + j]));
+			float b = ((float) pgm_read_float_near(&B[i * J + j]));
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = a + B
+inline __attribute__((always_inline)) void MatAddBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = *A;
+			float b = B[i * J + j];
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A + b
+inline __attribute__((always_inline)) void MatAddBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = *B;
+
+			float c = a + b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A - B
+inline __attribute__((always_inline)) void MatSub(float* A, const float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = ((float) pgm_read_float_near(&B[i * J + j]));
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = a - B
+inline __attribute__((always_inline)) void MatSubBroadCastA(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = *A;
+			float b = B[i * J + j];
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A - b
+inline __attribute__((always_inline)) void MatSubBroadCastB(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB, MYINT shrC) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = *B;
+
+			float c = a - b;
+
+			C[i * J + j] = c;
+		}
+	}
+	return;
+}
+
+// C = A * B
+inline __attribute__((always_inline)) void MatMulNN(float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+				count = (count + 1) >> 1;
+
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+inline __attribute__((always_inline)) void MatMulCN(const float* A, float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = ((float) pgm_read_float_near(&A[i * K + k]));
+				float b = B[k * J + j];
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+				count = (count + 1) >> 1;
+
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+inline __attribute__((always_inline)) void MatMulNC(float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = A[i * K + k];
+				float b = ((float) pgm_read_float_near(&B[k * J + j]));
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+				count = (count + 1) >> 1;
+
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A * B
+inline __attribute__((always_inline)) void MatMulCC(const float* A, const float* B, float* C, float* tmp, MYINT I, MYINT K, MYINT J, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			for (MYITE k = 0; k < K; k++) {
+				float a = ((float) pgm_read_float_near(&A[i * K + k]));
+				float b = ((float) pgm_read_float_near(&B[k * J + j]));
+
+				tmp[k] = a * b;
+			}
+
+			MYITE count = K, depth = 0;
+			bool shr = true;
+
+			while (depth < (H1 + H2)) {
+				if (depth >= H1) {
+					shr = false;
+				}
+
+				for (MYITE p = 0; p < (K / 2 + 1); p++) {
+					float sum;
+					if (p < (count >> 1)) {
+						sum = tmp[2 * p] + tmp[(2 * p) + 1];
+					} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+						sum = tmp[2 * p];
+					} else {
+						sum = 0;
+					}
+
+					if (shr) {
+						tmp[p] = sum;
+					} else {
+						tmp[p] = sum;
+					}
+				}
+				count = (count + 1) >> 1;
+
+				depth++;
+			}
+
+			C[i * J + j] = tmp[0];
+		}
+	}
+	return;
+}
+
+// C = A |*| B
+inline __attribute__((always_inline)) void SparseMatMulX(const MYINT* Aidx, const float* Aval, float* C, MYINT K, MYINT shrA, MYINT shrB, MYINT shrC) {
+	MYITE ite_idx = 0, ite_val = 0;
+	for (MYITE k = 0; k < K; k++) {
+		float b = getFloatFeature(k);
+
+		#ifdef INT16
+		MYINT idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
+		#else
+		MYINT idx = ((MYINT) pgm_read_dword_near(&Aidx[ite_idx]));
+		#endif
+
+		while (idx != 0) {
+			float a = ((float) pgm_read_float_near(&Aval[ite_val]));
+
+			float c = a * b;
+
+			C[idx - 1] += c;
+
+			ite_idx++;
+			ite_val++;
+
+			#ifdef INT16
+			idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
+			#else
+			idx = ((MYINT) pgm_read_dword_near(&Aidx[ite_idx]));
+			#endif
+		}
+		ite_idx++;
+	}
+
+	return;
+}
+
+// C = A |*| B
+inline __attribute__((always_inline)) void SparseMatMul(const MYINT* Aidx, const float* Aval, float* B, float* C, MYINT K, MYINT shrA, MYINT shrB, MYINT shrC) {
+	MYITE ite_idx = 0, ite_val = 0;
+	for (MYITE k = 0; k < K; k++) {
+		float b = B[k];
+
+		#ifdef INT16
+		MYINT idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
+		#else
+		MYINT idx = ((MYINT) pgm_read_dword_near(&Aidx[ite_idx]));
+		#endif
+
+		while (idx != 0) {
+			float a = ((float) pgm_read_float_near(&Aval[ite_val]));
+
+			float c = a * b;
+
+			C[idx - 1] += c;
+
+			ite_idx++;
+			ite_val++;
+
+			#ifdef INT16
+			idx = ((MYINT) pgm_read_word_near(&Aidx[ite_idx]));
+			#else
+			idx = ((MYINT) pgm_read_dword_near(&Aidx[ite_idx]));
+			#endif
+		}
+		ite_idx++;
+	}
+
+	return;
+}
+
+// C = A <*> B
+inline __attribute__((always_inline)) void MulCir(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float a = A[i * J + j];
+			float b = B[i * J + j];
+
+			C[i * J + j] = a * b;
+		}
+	}
+	return;
+}
+
+// A = tanh(A)
+inline __attribute__((always_inline)) void TanH(float* A, MYINT I, MYINT J, float tanh_limit) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j], y;
+
+			y = tanh(x);
+
+			A[i * J + j] = y;
+		}
+	}
+	return;
+}
+
+// index = argmax(A)
+inline __attribute__((always_inline)) void ArgMax(float* A, MYINT I, MYINT J, MYINT* index) {
+	float max = A[0];
+	MYITE maxIndex = 0, counter = 0;
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j];
+
+			if (max < x) {
+				maxIndex = counter;
+				max = x;
+			}
+
+			counter++;
+		}
+	}
+
+	*index = maxIndex;
+	return;
+}
+
+// A = A^T
+inline __attribute__((always_inline)) void Transpose(float* A, float* B, MYINT I, MYINT J) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			B[i * J + j] = A[j * I + i];
+		}
+	}
+	return;
+}
+
+// C = a * B
+inline __attribute__((always_inline)) void ScalarMul(float* A, float* B, float* C, MYINT I, MYINT J, MYINT shrA, MYINT shrB) {
+	float a = *A;
+
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float b = B[i * J + j];
+
+			C[i * J + j] = a * b;
+		}
+	}
+
+	return;
+}
+
+// C = A # B
+// A[N][H][W][CI], B[HF][WF][CI][CO], C[N][H][W][CO]
+inline __attribute__((always_inline)) void Conv(float* A, const float* B, float* C, float* tmp, MYINT N, MYINT H, MYINT W, MYINT CI, MYINT HF, MYINT WF, MYINT CO, MYINT shrA, MYINT shrB, MYINT H1, MYINT H2) {
+	MYITE padH = (HF - 1) / 2;
+	MYITE padW = (WF - 1) / 2;
+
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE co = 0; co < CO; co++) {
+
+					MYITE counter = 0;
+					for (MYITE hf = 0; hf < HF; hf++) {
+						for (MYITE wf = 0; wf < WF; wf++) {
+							for (MYITE ci = 0; ci < CI; ci++) {
+								float a = (((((h + hf) < padH) || ((h + hf) >= (H + padH))) || (((w + wf) < padW) || ((w + wf) >= (W + padW)))) ? 0 : A[n * H * W * CI + ((h + hf) - padH) * W * CI + ((w + wf) - padW) * CI + ci]);
+
+								float b = ((float) pgm_read_float_near(&B[hf * WF * CI * CO + wf * CI * CO + ci * CO + co]));
+
+								tmp[counter] = a * b;
+								counter++;
+							}
+						}
+					}
+
+					MYITE totalEle = HF * WF * CI;
+					MYITE count = HF * WF * CI, depth = 0;
+					bool shr = true;
+
+					while (depth < (H1 + H2)) {
+						if (depth >= H1) {
+							shr = false;
+						}
+
+						for (MYITE p = 0; p < (totalEle / 2 + 1); p++) {
+							float sum;
+							if (p < (count >> 1)) {
+								sum = tmp[2 * p] + tmp[(2 * p) + 1];
+							} else if ((p == (count >> 1)) && ((count & 1) == 1)) {
+								sum = tmp[2 * p];
+							} else {
+								sum = 0;
+							}
+
+							if (shr) {
+								tmp[p] = sum;
+							} else {
+								tmp[p] = sum;
+							}
+						}
+						count = (count + 1) >> 1;
+
+						depth++;
+					}
+
+					C[n * H * W * CO + h * W * CO + w * CO + co] = tmp[0];
+				}
+			}
+		}
+	}
+
+	return;
+}
+
+// A = A <+> B
+// A[N][H][W][C], B[C]
+inline __attribute__((always_inline)) void AddOrSubCir4D(float* A, const float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+
+					float b = ((float) pgm_read_float_near(&B[c]));
+
+					float res;
+					if (add) {
+						res = a + b;
+					} else {
+						res = a - b;
+					}
+
+					A[n * H * W * C + h * W * C + w * C + c] = res;
+				}
+			}
+		}
+	}
+
+	return;
+}
+
+// A = A <+> B
+// A[N][H][W][C], B[C]
+inline __attribute__((always_inline)) void AddOrSubCir2D(float* A, const float* B, MYINT H, MYINT W, MYINT shrA, MYINT shrB, MYINT shrC, bool add) {
+	for (MYITE h = 0; h < H; h++) {
+		for (MYITE w = 0; w < W; w++) {
+			float a = A[h * W + w];
+
+			float b = ((float) pgm_read_float_near(&B[w]));
+
+			float res;
+			if (add) {
+				res = a + b;
+			} else {
+				res = a - b;
+			}
+
+			A[h * W + w] = res;
+		}
+	}
+
+	return;
+}
+
+// A = relu(A)
+// A[N][H][W][C]
+inline __attribute__((always_inline)) void Relu4D(float* A, MYINT N, MYINT H, MYINT W, MYINT C) {
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE h = 0; h < H; h++) {
+			for (MYITE w = 0; w < W; w++) {
+				for (MYITE c = 0; c < C; c++) {
+					float a = A[n * H * W * C + h * W * C + w * C + c];
+					if (a < 0) {
+						a = 0;
+					}
+
+					A[n * H * W * C + h * W * C + w * C + c] = a;
+				}
+			}
+		}
+	}
+
+	return;
+}
+
+// A = relu(A)
+// A[N][H][W][C]
+inline __attribute__((always_inline)) void Relu2D(float* A, MYINT H, MYINT W) {
+	for (MYITE h = 0; h < H; h++) {
+		for (MYITE w = 0; w < W; w++) {
+			float a = A[h * W + w];
+			if (a < 0) {
+				a = 0;
+			}
+
+			A[h * W + w] = a;
+		}
+	}
+
+	return;
+}
+
+// B = maxpool(A)
+// A[N][H][W][C], B[N][H][W][C]
+inline __attribute__((always_inline)) void Maxpool(float* A, float* B, MYINT N, MYINT H, MYINT W, MYINT C, MYINT stride) {
+	MYITE HO = H / stride;
+	MYITE WO = W / stride;
+
+	for (MYITE n = 0; n < N; n++) {
+		for (MYITE ho = 0; ho < HO; ho++) {
+			for (MYITE wo = 0; wo < WO; wo++) {
+				for (MYITE c = 0; c < C; c++) {
+
+					float max = A[n * H * W * C + (stride * ho) * W * C + (stride * wo) * C + c];
+					for (MYITE hs = 0; hs < stride; hs++) {
+						for (MYITE ws = 0; ws < stride; ws++) {
+							float a = A[n * H * W * C + ((stride * ho) + hs) * W * C + ((stride * wo) + ws) * C + c];
+							if (a > max) {
+								max = a;
+							}
+						}
+					}
+
+					B[n * HO * WO * C + ho * WO * C + wo * C + c] = max;
+				}
+			}
+		}
+	}
+
+	return;
+}
+
+// B = exp(A)
+inline __attribute__((always_inline)) void Exp(float* A, MYINT I, MYINT J, MYINT shrA, MYINT shrB, float* B) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j];
+
+			B[i * J + j] = exp(x);
+		}
+	}
+
+	return;
+}
+
+// A = sigmoid(A)
+inline __attribute__((always_inline)) void Sigmoid(float* A, MYINT I, MYINT J, float div, float add, float sigmoid_limit, MYINT scale) {
+	for (MYITE i = 0; i < I; i++) {
+		for (MYITE j = 0; j < J; j++) {
+			float x = A[i * J + j], y;
+
+			y = 1 / (1 + exp(-x));
+
+			A[i * J + j] = y;
+		}
+	}
+	return;
+}
+
+// A = AdjustScaleShr(A)
+inline __attribute__((always_inline)) void AdjustScaleShr(float* A, MYINT I, MYINT J, MYINT scale) {
+	return;
+}
+
+// A = AdjustScaleShl(A)
+inline __attribute__((always_inline)) void AdjustScaleShl(float* A, MYINT I, MYINT J, MYINT scale) {
+	return;
+}
--- a/tools/SeeDot/seedot/arduino/model.h
+++ b/tools/SeeDot/seedot/arduino/model.h
--- a/tools/SeeDot/seedot/arduino/predict.cpp
+++ b/tools/SeeDot/seedot/arduino/predict.cpp
@ -1,91 +0,0 @@
-#include <Arduino.h>
-
-#include "config.h"
-#include "predict.h"
-#include "library.h"
-#include "model.h"
-
-using namespace model;
-
-const PROGMEM MYINT EXP13A[64] = {
-	8192, 7695, 7229, 6791, 6379, 5993, 5630, 5289, 4968, 4667, 4384, 4119, 3869, 3635, 3414, 3208, 3013, 2831, 2659, 2498, 2347, 2204, 2071, 1945, 1827, 1717, 1613, 1515, 1423, 1337, 1256, 1180, 1108, 1041, 978, 919, 863, 811, 761, 715, 672, 631, 593, 557, 523, 491, 462, 434, 407, 383, 359, 338, 317, 298, 280, 263, 247, 232, 218, 205, 192, 180, 170, 159, 
-};
-const PROGMEM MYINT EXP13B[64] = {
-	15967, 15952, 15936, 15921, 15905, 15890, 15874, 15859, 15843, 15828, 15812, 15797, 15781, 15766, 15751, 15735, 15720, 15704, 15689, 15674, 15659, 15643, 15628, 15613, 15598, 15582, 15567, 15552, 15537, 15522, 15506, 15491, 15476, 15461, 15446, 15431, 15416, 15401, 15386, 15371, 15356, 15341, 15326, 15311, 15296, 15281, 15266, 15251, 15236, 15221, 15206, 15192, 15177, 15162, 15147, 15132, 15118, 15103, 15088, 15073, 15059, 15044, 15029, 15015, 
-};
-
-int predict() {
-	MYINT tmp4;
-	MYINT tmp5[25][1];
-	MYINT i;
-	MYINT tmp6[25][1];
-	MYINT tmp7;
-	MYINT tmp8[1][25];
-	MYINT tmp10[1][1];
-	MYINT tmp9[25];
-	MYINT tmp11[1][1];
-	MYINT tmp15[1][1];
-	MYINT tmp12;
-	MYINT tmp13;
-	MYINT tmp14;
-	MYINT tmp17[10][1];
-	MYINT tmp16[1];
-	MYINT tmp18[10][1];
-	MYINT tmp19;
-
-	tmp4 = 16106;
-
-
-	// W |*| X
-	memset(tmp5, 0, sizeof(MYINT) * 25);
-	SparseMatMul(&Widx[0], &Wval[0], &tmp5[0][0], 256, 128, 128, 8);
-
-	memset(tmp18, 0, sizeof(MYINT) * 10);
-	i = 0;
-	for (MYINT i0 = 0; (i0 < 55); i0++) {
-
-		// WX - B
-		MatSub(&tmp5[0][0], &B[i][0][0], &tmp6[0][0], 25, 1, 1, 4, 1);
-
-		tmp7 = (-tmp4);
-
-		// del^T
-		Transpose(&tmp6[0][0], &tmp8[0][0], 1, 25);
-
-
-		// tmp8 * del
-		MatMulNN(&tmp8[0][0], &tmp6[0][0], &tmp10[0][0], &tmp9[0], 1, 25, 1, 128, 64, 0, 5);
-
-
-		// tmp7 * tmp10
-		ScalarMul(&tmp7, &tmp10[0][0], &tmp11[0][0], 1, 1, 128, 128);
-
-
-		// exp(tmp11)
-		if (((-tmp11[0][0]) < 210)) {
-			tmp13 = 0;
-			tmp14 = 0;
-		} else {
-			tmp12 = (((-tmp11[0][0]) - 210) << 1);
-			tmp13 = ((tmp12 >> 10) & 63);
-			tmp14 = ((tmp12 >> 4) & 63);
-		}
-		tmp15[0][0] = ((((MYINT) pgm_read_word_near(&EXP13A[tmp13])) >> 7) * (((MYINT) pgm_read_word_near(&EXP13B[tmp14])) >> 7));
-
-		// Z * tmp15
-		MatMulCN(&Z[i][0][0], &tmp15[0][0], &tmp17[0][0], &tmp16[0], 10, 1, 1, 128, 128, 0, 0);
-
-		for (MYINT i1 = 0; (i1 < 10); i1++) {
-			for (MYINT i2 = 0; (i2 < 1); i2++) {
-				tmp18[i1][i2] = (tmp18[i1][i2] + (tmp17[i1][i2] / 32));
-			}
-		}
-		i = (i + 1);
-	}
-
-	// argmax(res)
-	ArgMax(&tmp18[0][0], 10, 1, &tmp19);
-
-
-	return tmp19;
-}
--- a/tools/SeeDot/seedot/arduino/predict.h
+++ b/tools/SeeDot/seedot/arduino/predict.h
@ -4,5 +4,5 @@
 #pragma once

 int predict();
-MYINT getIntFeature(MYINT);
+int32_t getIntFeature(MYITE);
 float getFloatFeature(MYINT);
--- a/tools/SeeDot/seedot/common.py
+++ b/tools/SeeDot/seedot/common.py
@ -1,48 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT license.
-
-# Target word length. Currently set to match the word length of Arduino (2
-# bytes)
-wordLength = 16
-
-inputFileType = "npy"
-
-# Range of max scale factor used for exploration
-maxScaleRange = 0, -wordLength
-
-# tanh approximation limit
-tanh_limit = 1.0
-
-# MSBuild location
-# Edit the path if not present at the following location
-msbuildPathOptions = [r"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\MSBuild.exe",
-                      r"C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\15.0\Bin\MSBuild.exe",
-                      r"C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\MSBuild.exe"
-                      ]
-
-
-class Algo:
-    Bonsai = "bonsai"
-    Protonn = "protonn"
-    Default = [Bonsai, Protonn]
-    All = [Bonsai, Protonn]
-
-
-class Version:
-    Fixed = "fixed"
-    Float = "float"
-    All = [Fixed, Float]
-
-
-class DatasetType:
-    Training = "training"
-    Testing = "testing"
-    Default = Testing
-    All = [Training, Testing]
-
-
-class Target:
-    Arduino = "arduino"
-    X86 = "x86"
-    Default = Arduino
-    All = [Arduino, X86]
--- a/tools/SeeDot/seedot/compiler/.gitignore
+++ b/tools/SeeDot/seedot/compiler/.gitignore
@ -0,0 +1 @@
+output/
--- a/tools/SeeDot/seedot/compiler/ONNX/.gitignore
+++ b/tools/SeeDot/seedot/compiler/ONNX/.gitignore
@ -0,0 +1,6 @@
+models/
+debug/
+*.ezpc
+*.cpp
+*.h
+*.npy
--- a/tools/SeeDot/seedot/compiler/ONNX/ONNXNodesAST.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/ONNXNodesAST.py
@ -0,0 +1,916 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import seedot.compiler.ast.ast as AST
+from onnx import mapping
+from onnx import TensorProto
+from numbers import Number
+import numpy as np
+
+# We use operations from this.
+from seedot.compiler.antlr.seedotParser import seedotParser as SeeDotParser
+
+DEBUG = False
+out_var_prefix = 'J'
+
+
+class OnnxNode(object):
+  """
+  Re-implementation of NodeProto from ONNX, but in a form
+  more convenient to work with from Python.
+  """
+
+  def __init__(self, node):
+    self.name = str(node.name)
+    self.op_type = str(node.op_type)
+    self.domain = str(node.domain)
+    self.attrs = dict([(attr.name,
+                       translate_onnx(attr.name, convert_onnx(attr)))
+                       for attr in node.attribute])
+    self.inputs = list(node.input)
+    self.outputs = list(node.output)
+    self.node_proto = node
+
+__onnx_attr_translator = {
+    "axis": lambda x: int(x),
+    "axes": lambda x: [int(a) for a in x],
+    "dtype": lambda x: onnx2seedot(x),
+    "keepdims": lambda x: bool(x),
+    "to": lambda x: onnx2seedot(x),
+}
+
+
+def convert_onnx(attr):
+  return __convert_onnx_attribute_proto(attr)
+
+def __convert_onnx_attribute_proto(attr_proto):
+  """
+  Convert an ONNX AttributeProto into an appropriate Python object
+  for the type.
+  NB: Tensor attribute gets returned as the straight proto.
+  """
+  if attr_proto.HasField('f'):
+    return attr_proto.f
+  elif attr_proto.HasField('i'):
+    return attr_proto.i
+  elif attr_proto.HasField('s'):
+    return str(attr_proto.s, 'utf-8') 
+  elif attr_proto.HasField('t'):
+    return attr_proto.t  # This is a proto!
+  elif attr_proto.HasField('g'):
+    return attr_proto.g
+  elif attr_proto.floats:
+    return list(attr_proto.floats)
+  elif attr_proto.ints:
+    return list(attr_proto.ints)
+  elif attr_proto.strings:
+    str_list = list(attr_proto.strings)
+    if IS_PYTHON3:
+      str_list = list(map(lambda x: str(x, 'utf-8'), str_list))
+    return str_list
+  elif attr_proto.HasField('sparse_tensor'):
+    return attr_proto.sparse_tensor
+  else:
+    raise ValueError("Unsupported ONNX attribute: {}".format(attr_proto))
+
+def translate_onnx(key, val):
+  return __onnx_attr_translator.get(key, lambda x: x)(val)
+
+def onnx2seedot(dtype):
+  return TENSOR_TYPE_TO_SEEDOT_TYPE[_onnx_dtype(dtype)]
+
+def _onnx_dtype(dtype):
+  if isinstance(dtype, Number):
+    onnx_dype = dtype
+  elif isinstance(dtype, str):
+    onnx_dype = TensorProto.DataType.Value(dtype)
+  else:
+    raise RuntimeError("dtype should be number or str.")
+  return onnx_dype
+
+TENSOR_TYPE_TO_SEEDOT_TYPE = {
+    int(TensorProto.FLOAT): 'float32',
+    int(TensorProto.UINT8): 'uint8',
+    int(TensorProto.INT8): 'int8',
+    int(TensorProto.UINT16): 'uint16',
+    int(TensorProto.INT16): 'int16',
+    int(TensorProto.INT32): 'int32',
+    int(TensorProto.INT64): 'int64',
+    int(TensorProto.BOOL): 'bool',
+    int(TensorProto.FLOAT16): 'float16',
+    int(TensorProto.DOUBLE): 'float64',
+    int(TensorProto.COMPLEX64): 'complex64',
+    int(TensorProto.COMPLEX128): 'complex128',
+    int(TensorProto.UINT32): 'uint32',
+    int(TensorProto.UINT64): 'uint64',
+    int(TensorProto.STRING): 'string'
+}
+
+def getOperatorsIdx(token):
+	# TODO : Remove usage of this.
+	return AST.Operators.convSymbolToEnumValue(token)
+
+def get_seedot_shape_order(old_shape):
+	if(len(old_shape) == 4):
+		# Case when spatial dimension is 2.
+		# Inverse of [1, 3, 4, 2] is [1, 4, 2, 3].
+		# return ([old_shape[0], old_shape[2], old_shape[3], old_shape[1]], [1, 4, 2, 3])
+		return ([old_shape[0], old_shape[2], old_shape[3], old_shape[1]], [1, 3, 4, 2])
+	else:
+		# Case when spatial dimension is 3.
+		# Inverse of [1, 3, 4, 5, 2] is [1, 5, 2, 3, 4].
+		return ([old_shape[0], old_shape[2], old_shape[3], old_shape[4], old_shape[1]], [1, 3, 4, 5, 2])
+
+def get_seedot_filter_shape_order(filter_shape):
+	if(len(filter_shape) == 4):
+		# Case when spatial dimension is 2.
+		# Inverse of [3, 4, 2, 1] is [4, 3, 1, 2].
+		# return ([filter_shape[2], filter_shape[3], filter_shape[1], filter_shape[0]], [4, 3, 1, 2])
+		return ([1, filter_shape[2], filter_shape[3], filter_shape[1], filter_shape[0]], [3, 4, 2, 1])
+	else:
+		# Case when spatial dimension is 3.
+		# Inverse of [3, 4, 5, 2, 1] is [5, 4, 1, 2, 3].
+		return ([1, filter_shape[2], filter_shape[3], filter_shape[4], filter_shape[1], filter_shape[0]], [3, 4, 5, 2, 1])
+
+def get_onnx_order(onnx_shape):
+	if(len(onnx_shape) == 4):
+		# Inverse of [1, 4, 2, 3] is [1, 3, 4, 2].
+		# return [1, 3, 4, 2]
+		return [1, 4, 2, 3]
+	else:
+		# Inverse of [1, 5, 2, 3, 4] is [1, 3, 4, 5, 2].
+		return [1, 5, 2, 3, 4]
+
+def get_reshaped_input_ast(input_name, value_info, node_name_to_out_var_dict):
+	onnx_input_shape = list(value_info[input_name][1])
+	(seedot_input_shape, seedot_input_order) = get_seedot_shape_order(onnx_input_shape)
+	return AST.Reshape(AST.ID(node_name_to_out_var_dict[input_name]), seedot_input_shape, seedot_input_order)
+
+def get_reshaped_bias_ast(bias_name, value_info, node_name_to_out_var_dict, dim):
+	if(dim == 2):
+		return AST.Reshape(AST.ID(node_name_to_out_var_dict[bias_name]), [1, 1, 1, value_info[bias_name][1][0]], None)
+	else:
+		return AST.Reshape(AST.ID(node_name_to_out_var_dict[bias_name]), [1, 1, 1, 1, value_info[bias_name][1][0]], None)
+
+def get_reshaped_filter_ast(filter_name, value_info, node_name_to_out_var_dict):
+	onnx_filter_shape = list(value_info[filter_name][1])
+	(seedot_filter_shape, seedot_filter_order) = get_seedot_filter_shape_order(onnx_filter_shape)
+	return AST.Reshape(AST.ID(node_name_to_out_var_dict[filter_name]), seedot_filter_shape, seedot_filter_order)
+
+def get_reshaped_output_ast(onnx_output_name, value_info, output_name):
+	onnx_output_shape = list(value_info[onnx_output_name][1])
+	onnx_output_order = get_onnx_order(onnx_output_shape)
+	return AST.Reshape(AST.ID(output_name), onnx_output_shape, onnx_output_order)
+
+def get_new_var_name(out_var_count):
+	return out_var_prefix + str(out_var_count)
+
+def update_program_with_new_node(innermost_let_ast_node, new_node, new_node_name, mtdAST):
+	cur_out_var_ast_node = AST.ID(new_node_name)
+	new_let_node = AST.Let(new_node_name, new_node, cur_out_var_ast_node)
+	# mtdAST.visit(new_let_node, {AST.ASTNode.mtdKeyTFOpName : 'no', AST.ASTNode.mtdKeyTFNodeName : 'no'})
+	# Updating the innermost Let AST node and the expression for previous Let Node.
+	innermost_let_ast_node.expr = new_let_node
+	innermost_let_ast_node = new_let_node
+
+	# node_name_to_out_var_dict[node.outputs[0]] = new_node_name
+	return innermost_let_ast_node
+
+
+class ONNXNodesAST:
+
+	# value_info: dictionary of name -> (type, dimension tuple)
+	def Input(node, value_info, node_name_to_out_var_dict, init_val=None):
+		if(DEBUG):
+			print(node.outputs[0])
+		# There are two types of inputs.
+		dims = list(node.dims if hasattr(node, 'dims') else ([val.dim_value for val in  node.type.tensor_type.shape.dim]))
+		data_type = node.data_type if hasattr (node, 'data_type') else node.type.tensor_type.elem_type
+		# return AST.Input(dims, onnx2seedot(data_type))
+
+		from onnx import numpy_helper
+		range = (-3,3)
+
+		if init_val is not None:
+			arr = numpy_helper.to_array(init_val)
+			range = (np.min(arr),np.max(arr))
+
+		return AST.Decl(dims, range)
+
+	def Cast(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 1)
+		# destType = node.attrs['to']
+
+		# seedot_output_ast = AST.UninterpFuncCall(value_info[node.outputs[0]][1],
+		# 									'Cast', 
+		# 									[AST.ID(inputsRef[0]), 
+		# 									AST.ID(destType),
+		# 									AST.ID(destType)
+		# 									])
+		# output_name = get_new_var_name(out_var_count)
+		# innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		# out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Squeeze(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 1)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Unsqueeze(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 1)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Shape(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 1)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+	
+	def Gather(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 2)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Slice(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 4)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def ConstantOfShape(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 1)
+		node_name_to_out_var_dict[node.outputs[0]] = inputsRef[0]
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Pad(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		# Skip constant_val input (last input).
+		inpLen = len(inputsRef) - 1
+		assert(inpLen == 2)
+		inputs = [AST.ID(node_name_to_out_var_dict[inputsRef[x]]) for x in range(0, inpLen)]
+		mode = node.attrs['mode']
+		assert(mode == 'constant')
+		seedot_output_ast = AST.UninterpFuncCall(list(value_info[node.outputs[0]][1]),
+							 'PadONNX', inputs)
+
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Concat(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		N = len(inputsRef)
+
+		inputs = [AST.ID(node_name_to_out_var_dict[inputsRef[x]]) for x in range(0, len(inputsRef))]
+		axis = node.attrs['axis']
+
+		seedot_output_ast = AST.UninterpFuncCall(list(value_info[node.outputs[0]][1]),
+									 'Concat'+str(N) + 'T',
+									 inputs + [AST.Int(axis, 32, False)],
+									outputDiffInpDims=1
+									)
+
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Relu(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+
+		inputsRef = node.inputs
+		assert(len(inputsRef)==1)
+		
+		spatial_size = len(value_info[inputsRef[0]][1])
+
+		relu_input_name = node_name_to_out_var_dict[inputsRef[0]]
+		if(spatial_size >= 4):
+			relu_input_name = get_new_var_name(out_var_count)
+			reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, relu_input_name, mtdAST)
+			out_var_count += 1
+
+		seedot_output_ast = AST.Func(SeeDotParser.RELU, AST.ID(relu_input_name))
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		final_output_name = output_name
+		if(spatial_size >= 4):
+			final_output_name = get_new_var_name(out_var_count)
+			onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, final_output_name, mtdAST)
+			out_var_count += 1
+
+		node_name_to_out_var_dict[node.outputs[0]] = final_output_name
+
+		if(DEBUG):
+			print(node.outputs[0])
+			print(onnx_input_shape, '->', seedot_input_shape, '->', onnx_output_shape)
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Add(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 2)
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		reshaped_input_name1 = get_new_var_name(out_var_count)
+		reshaped_input1 = get_reshaped_input_ast(inputsRef[1], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input1, reshaped_input_name1, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast = AST.BOp(AST.ID(reshaped_input_name),
+							getOperatorsIdx('+'),
+							AST.ID(reshaped_input_name1)
+							)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		if(DEBUG):
+			print(node.outputs[0])
+			print(onnx_input_shape, onnx_input_shape1, '->', seedot_input_shape, seedot_input_shape1, '->', onnx_output_shape)
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Gemm(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef) == 3)
+		input1AST = AST.ID(node_name_to_out_var_dict[inputsRef[0]])
+		input2AST = AST.ID(node_name_to_out_var_dict[inputsRef[1]])
+
+		if('transA' in node.attrs and node.attrs['transA']): input1AST = AST.Transp(input1AST)
+		if('transB' in node.attrs and node.attrs['transB']): input2AST = AST.Transp(input2AST)
+
+		# W * x + b
+		seedot_output_ast = AST.Bop1(AST.Bop1(input1AST, SeeDotParser.MUL, input2AST), SeeDotParser.ADDCIR, AST.ID(node_name_to_out_var_dict[inputsRef[2]]))
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def ArgMax(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+
+		seedot_output_ast = AST.Func(SeeDotParser.ARGMAX, AST.ID(node_name_to_out_var_dict[inputsRef[0]]))
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Constant(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		# TODO: Use AST.decl for defining a tensor. If used as a parameter for Reshape then we don't need it for now.
+		return (innermost_let_ast_node, out_var_count)
+
+	def Transpose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+
+		inputsRef = node.inputs
+		assert(len(inputsRef)==1)
+
+		seedot_output_ast = AST.Transpose(AST.ID(node_name_to_out_var_dict[inputsRef[0]]), node.attrs['perm'])
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	# Only supports split into equal parts.
+	def Split(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		inputsRef = node.inputs
+		output_count = len(node.outputs)
+
+		for cur_count in range(output_count):
+			seedot_output_ast = AST.UninterpFuncCall(list(value_info[node.outputs[cur_count]][1]), 'Split',
+				 [AST.ID(node_name_to_out_var_dict[inputsRef[0]]), AST.Int(node.attrs['axis'], 32, False), AST.Int(cur_count, 32, False), AST.Int(output_count, 32, False)])
+			output_name = get_new_var_name(out_var_count)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+			out_var_count += 1
+			node_name_to_out_var_dict[node.outputs[cur_count]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def ReduceMean(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		inputsRef = node.inputs
+
+		keepdims = node.attrs['keepdims']
+		axes = node.attrs['axes']
+
+		# Currently handling only this case.
+		# Currently support only 0 case.
+		assert(keepdims == 0)
+		assert(len(axes) == 2)
+
+		seedot_output_ast = AST.UninterpFuncCall(value_info[node.outputs[0]][1], 'ReduceMeanO',
+				[AST.ID(node_name_to_out_var_dict[inputsRef[0]]), AST.Int(axes[0], 32, False), AST.Int(axes[1], 32, False)])
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+		return (innermost_let_ast_node, out_var_count)
+
+	def BatchNormalization(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+
+		inputsRef = node.inputs
+		# Are running mean and var used for something?
+		assert(len(inputsRef)==5)
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast = AST.FusedBatchNorm(AST.ID(reshaped_input_name),
+										 AST.ID(node_name_to_out_var_dict[inputsRef[1]]),
+										 AST.ID(node_name_to_out_var_dict[inputsRef[2]]),
+										)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		if(DEBUG):
+			print(node.outputs[0])
+			print(onnx_input_shape, '->', seedot_input_shape, '->', onnx_output_shape)
+
+		return (innermost_let_ast_node, out_var_count) 	
+
+	def Reshape(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+
+		inputsRef = node.inputs
+		assert(len(inputsRef)==2)
+		# print(list(value_info[node.outputs[0]][1]))
+
+		seedot_output_ast = AST.Reshape(AST.ID(node_name_to_out_var_dict[inputsRef[0]]), list(value_info[node.outputs[0]][1]), None)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Flatten(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+
+		inputsRef = node.inputs
+		assert(len(inputsRef)==1)
+
+		seedot_output_ast = AST.Reshape(AST.ID(node_name_to_out_var_dict[inputsRef[0]]), list(value_info[node.outputs[0]][1]), None)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def Conv(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+
+		inputsRef = node.inputs
+		# Since two dimensions represent N: Number of batches and CI: Input channel.
+		inputShape = value_info[inputsRef[0]][1]
+		spatial_size = len(inputShape)-2
+
+		if spatial_size == 2:
+			(innermost_let_ast_node, out_var_count, output_name) = ONNXNodesAST.conv2d(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST)
+		elif spatial_size == 3:
+			(innermost_let_ast_node, out_var_count, output_name) = ONNXNodesAST.conv3d(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST)
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def conv2d(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		inputsRef = node.inputs
+		inputShape = value_info[inputsRef[0]][1]
+		filterShape = value_info[inputsRef[1]][1]
+
+		stridesUsed = node.attrs['strides'] if 'strides' in node.attrs else [1,1]
+		group = node.attrs['group'] if 'group' in node.attrs else 1
+		padding = node.attrs['pads'] if 'pads' in node.attrs else [0,0,0,0]
+		dilation = node.attrs['dilation'] if 'dilation' in node.attrs else [1,1]
+
+		assert(len(inputsRef)==2 or len(inputsRef)==3)
+		assert(len(stridesUsed)==2)
+
+		# We assume VALID case when the padding is in string format.
+
+		# print(inputShape, filterShape)
+		assert (inputShape[1] == filterShape[1]*group)
+		# For Input:
+		# [N, CI, H, W] is the ONNX order it should be changed to
+		# [N, H, W, CI] order.
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		# For filter:
+		# [CO, CI1, FH, FW] is the ONNX order it should be changed to
+		# [FH, FW, CI1, CO] order.
+		reshaped_filter_name = get_new_var_name(out_var_count)
+		reshaped_filter = get_reshaped_filter_ast(inputsRef[1], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_filter, reshaped_filter_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast =  AST.Convolution(AST.ID(reshaped_input_name), AST.ID(reshaped_filter_name), stridesUsed, padding, dilation, group)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		# If there is bias to be added then reshape and add it.
+		if (len(inputsRef) == 3):
+			seedot_output_ast = AST.Bop1(AST.ID(output_name), SeeDotParser.ADDCIR, AST.ID(inputsRef[2]))
+			output_name = get_new_var_name(out_var_count)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+			out_var_count += 1
+
+		return (innermost_let_ast_node, out_var_count, output_name)
+
+	def conv3d(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		inputsRef = node.inputs
+		inputShape = value_info[inputsRef[0]][1]
+		filterShape = value_info[inputsRef[1]][1]
+		stridesUsed = node.attrs['strides']
+
+		assert(len(inputsRef)==2 or len(inputsRef)==3)
+		assert(len(stridesUsed)==3)
+		assert(value_info[node.inputs[1]][1][2:] == tuple(node.attrs['kernel_shape']))
+		# Verify this order.
+		[zPadDLeft, zPadDRight, zPadHLeft, zPadHRight, zPadWLeft, zPadWRight] = node.attrs['pads']
+
+		assert (inputShape[1] == filterShape[1])
+		# For Input:
+		# [N, CI, D, H, W] is the ONNX order it should be changed to
+		# [N, D, H, W, CI] order.
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		# For filter:
+		# [CO, CI1, FD, FH, FW] is the ONNX order it should be changed to
+		# [FD, FH, FW, CI1, CO] order.
+		reshaped_filter_name = get_new_var_name(out_var_count)
+		reshaped_filter = get_reshaped_filter_ast(inputsRef[1], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_filter, reshaped_filter_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast =  AST.Bop1(AST.ID(reshaped_input_name), getOperatorsIdx('#'), AST.ID(reshaped_filter_name))
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		# If there is bias to be added then reshape and add it.
+		if (len(inputsRef) == 3):
+			reshaped_bias_name = get_new_var_name(out_var_count)
+			reshaped_bias = get_reshaped_bias_ast(inputsRef[2], value_info, node_name_to_out_var_dict, 3)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_bias, reshaped_bias_name, mtdAST)
+			out_var_count += 1
+
+			seedot_output_ast =  AST.Bop1(AST.ID(output_name), getOperatorsIdx('+'), AST.ID(reshaped_bias_name))
+			output_name = get_new_var_name(out_var_count)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+			out_var_count += 1
+
+		return (innermost_let_ast_node, out_var_count, output_name)
+
+	def MaxPool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		return ONNXNodesAST.helper_processPool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST, 'MAXPOOL')
+
+	def AvgPool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		return ONNXNodesAST.helper_processPool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST, 'AVGPOOL')
+
+	def GlobalAveragePool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef)==1)
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast = AST.Pool(AST.Pool.PoolType.AvgPool,
+							  AST.ID(reshaped_input_name),
+							  {
+							  	AST.PaddingKeysDict.FH: value_info[inputsRef[0]][1][2],
+							  	AST.PaddingKeysDict.FW: value_info[inputsRef[0]][1][3],
+							  	AST.PaddingKeysDict.zPadHLeft: 0,
+							  	AST.PaddingKeysDict.zPadHRight: 0,
+							  	AST.PaddingKeysDict.zPadWLeft: 0,
+							  	AST.PaddingKeysDict.zPadWRight: 0,
+							  	AST.PaddingKeysDict.strideH: 1,
+							  	AST.PaddingKeysDict.strideW: 1
+							  }
+							)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def helper_processPool(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST, typeOfPool):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+		inputsRef = node.inputs
+		assert(len(inputsRef)==1)
+
+		stridesUsed = node.attrs['strides'] if 'strides' in node.attrs else [1,1]
+		kernelSize = node.attrs['kernel_shape']
+		padding = node.attrs['pads'] if 'pads' in node.attrs else [0,0,0,0]
+
+		inputShape = value_info[inputsRef[0]][1]
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+
+		poolType = None
+		if typeOfPool=='MAXPOOL':
+			poolType = AST.Maxpool
+		elif typeOfPool=='AVGPOOL':
+			poolType = AST.Avgpool
+		else:
+			print("Unknown type of pooling layer.", file=sys.stderr)
+			assert(False)
+		seedot_output_ast = poolType(
+							  AST.ID(reshaped_input_name),
+							  kernelSize, padding, stridesUsed
+							)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def ConvTranspose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		node = OnnxNode(node)
+		if(DEBUG):
+			print(node)
+
+		inputsRef = node.inputs
+		# Since two dimensions represent N: Number of batches and CI: Input channel.
+		inputShape = value_info[inputsRef[0]][1]
+		spatial_size = len(inputShape)-2
+		if spatial_size == 2:
+			(innermost_let_ast_node, out_var_count, output_name) = ONNXNodesAST.conv2dtranspose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST)
+		elif spatial_size == 3:
+			(innermost_let_ast_node, out_var_count, output_name) = ONNXNodesAST.conv3dtranspose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST)
+
+		reshaped_output_name = get_new_var_name(out_var_count)
+		onnx_output_ast = get_reshaped_output_ast(node.outputs[0], value_info, output_name)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, onnx_output_ast, reshaped_output_name, mtdAST)
+		out_var_count += 1
+		node_name_to_out_var_dict[node.outputs[0]] = reshaped_output_name
+
+		return (innermost_let_ast_node, out_var_count)
+
+	def conv2dtranspose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		inputsRef = node.inputs
+		inputShape = value_info[inputsRef[0]][1]
+		filterShape = value_info[inputsRef[1]][1]
+		stridesUsed = node.attrs['strides']
+		outputShape = value_info[node.outputs[0]][1]
+
+		# Sometimes there is a bias to be added as well.
+		assert(len(inputsRef)==2 or len(inputsRef)==3)
+		assert(len(stridesUsed)==2)
+		assert(value_info[node.inputs[1]][1][2:] == tuple(node.attrs['kernel_shape']))
+		# Verify this order.
+		[zPadHLeft, zPadHRight, zPadWLeft, zPadWRight] = node.attrs['pads']
+
+		options = {}
+		options[AST.PaddingKeysDict.FH] = filterShape[2]
+		options[AST.PaddingKeysDict.FW] = filterShape[3]
+		options[AST.PaddingKeysDict.zPadHLeft] = zPadHLeft
+		options[AST.PaddingKeysDict.zPadHRight] = zPadHRight
+		options[AST.PaddingKeysDict.zPadWLeft] = zPadWLeft
+		options[AST.PaddingKeysDict.zPadWRight] = zPadWRight
+		options[AST.PaddingKeysDict.strideH] = stridesUsed[0]
+		options[AST.PaddingKeysDict.strideW] = stridesUsed[1]
+		options[AST.PaddingKeysDict.ConvDim] = 2
+		options[AST.PaddingKeysDict.outputImgH] = outputShape[2]
+		options[AST.PaddingKeysDict.outputImgW] = outputShape[3]
+
+		assert (inputShape[1] == filterShape[0])
+		# For Input:
+		# [N, CI, H, W] is the ONNX order it should be changed to
+		# [N, H, W, CI] order.
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+		# For filter:
+		# [CI, CO, FH, FW] is the ONNX order it should be changed to
+		# [FH, FW, CI1, CO] order.
+		reshaped_filter_name = get_new_var_name(out_var_count)
+		reshaped_filter = get_reshaped_filter_ast(inputsRef[1], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_filter, reshaped_filter_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast =  AST.BOp(AST.ID(reshaped_input_name), getOperatorsIdx('#T'), AST.ID(reshaped_filter_name), options)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		# If there is bias to be added then reshape and add it.
+		if (len(inputsRef) == 3):
+			biasShape = value_info[inputsRef[2]][1]
+			reshaped_bias_name = get_new_var_name(out_var_count)
+			reshaped_bias = get_reshaped_bias_ast(inputsRef[2], value_info, node_name_to_out_var_dict, 2)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_bias, reshaped_bias_name, mtdAST)
+			out_var_count += 1
+
+			seedot_output_ast =  AST.BOp(AST.ID(output_name), getOperatorsIdx('+'), AST.ID(reshaped_bias_name), options)
+			output_name = get_new_var_name(out_var_count)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+			out_var_count += 1
+
+		return (innermost_let_ast_node, out_var_count, output_name)
+
+	def conv3dtranspose(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST):
+		inputsRef = node.inputs
+		inputShape = value_info[inputsRef[0]][1]
+		filterShape = value_info[inputsRef[1]][1]
+		stridesUsed = node.attrs['strides']
+		outputShape = value_info[node.outputs[0]][1]
+
+		# Sometimes there is a bias to be added as well.
+		assert(len(inputsRef)==2 or len(inputsRef)==3)
+		assert(len(stridesUsed)==3)
+		assert(value_info[node.inputs[1]][1][2:] == tuple(node.attrs['kernel_shape']))
+		# Verify this order.
+		[zPadDLeft, zPadDRight, zPadHLeft, zPadHRight, zPadWLeft, zPadWRight] = node.attrs['pads']
+
+		options = {}
+		options[AST.PaddingKeysDict.FD] = filterShape[2]
+		options[AST.PaddingKeysDict.FH] = filterShape[3]
+		options[AST.PaddingKeysDict.FW] = filterShape[4]
+		options[AST.PaddingKeysDict.zPadDLeft] = zPadDLeft
+		options[AST.PaddingKeysDict.zPadDRight] = zPadDRight
+		options[AST.PaddingKeysDict.zPadHLeft] = zPadHLeft
+		options[AST.PaddingKeysDict.zPadHRight] = zPadHRight
+		options[AST.PaddingKeysDict.zPadWLeft] = zPadWLeft
+		options[AST.PaddingKeysDict.zPadWRight] = zPadWRight
+		options[AST.PaddingKeysDict.strideD] = stridesUsed[0]
+		options[AST.PaddingKeysDict.strideH] = stridesUsed[1]
+		options[AST.PaddingKeysDict.strideW] = stridesUsed[2]
+		options[AST.PaddingKeysDict.ConvDim] = 3
+		options[AST.PaddingKeysDict.outputImgD] = outputShape[2]
+		options[AST.PaddingKeysDict.outputImgH] = outputShape[3]
+		options[AST.PaddingKeysDict.outputImgW] = outputShape[4]
+
+		assert (inputShape[1] == filterShape[0])
+		# For Input:
+		# [N, CI, D, H, W] is the ONNX order it should be changed to
+		# [N, D, H, W, CI] order.
+
+		reshaped_input_name = get_new_var_name(out_var_count)
+		reshaped_input = get_reshaped_input_ast(inputsRef[0], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_input, reshaped_input_name, mtdAST)
+		out_var_count += 1
+		# For filter:
+		# [CI, CO, FD, FH, FW] is the ONNX order it should be changed to
+		# [FD, FH, FW, CI1, CO] order.
+		reshaped_filter_name = get_new_var_name(out_var_count)
+		reshaped_filter = get_reshaped_filter_ast(inputsRef[1], value_info, node_name_to_out_var_dict)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_filter, reshaped_filter_name, mtdAST)
+		out_var_count += 1
+
+		seedot_output_ast =  AST.BOp(AST.ID(reshaped_input_name), getOperatorsIdx('#T'), AST.ID(reshaped_filter_name), options)
+		output_name = get_new_var_name(out_var_count)
+		innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+		out_var_count += 1
+
+		# If there is bias to be added then reshape and add it.
+		if (len(inputsRef) == 3):
+			biasShape = value_info[inputsRef[2]][1]
+			reshaped_bias_name = get_new_var_name(out_var_count)
+			reshaped_bias = get_reshaped_bias_ast(inputsRef[2], value_info, node_name_to_out_var_dict, 3)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, reshaped_bias, reshaped_bias_name, mtdAST)
+			out_var_count += 1
+
+			seedot_output_ast =  AST.BOp(AST.ID(output_name), getOperatorsIdx('+'), AST.ID(reshaped_bias_name), options)
+			output_name = get_new_var_name(out_var_count)
+			innermost_let_ast_node = update_program_with_new_node(innermost_let_ast_node, seedot_output_ast, output_name, mtdAST)
+			out_var_count += 1
+
+		return (innermost_let_ast_node, out_var_count, output_name)
--- a/tools/SeeDot/seedot/compiler/ONNX/README.md
+++ b/tools/SeeDot/seedot/compiler/ONNX/README.md
@ -0,0 +1,46 @@
+# Introduction
+This part of the code compiles the ONNX model to SeeDot AST.
+
+A model name must be provided to the `compile.sh` script and the model must be placed in `./models` directory.
+The script can be run with `./compile.sh model_name.onnx` command on the command line.
+
+1) The script calls `onnx_run.py` to generate a random input of size matching the input size of the model. `onnx_run.py` further runs the model using `onnxruntime` and stores the output result as a `numpy` array. The input is stored as `model_name_input.npy` and the output is stored as `model_name_output.npy`.
+
+2) Then it runs `process_onnx.py`. This python code combines `model_name_input.npy` and the values of other variables stored in the model to generate a `model_name_input.h` file which is later fed to the final code as input. `model_name_input.h` has all the values stored as fixed-point integers using the value of scale in the script.
+
+3) Then it runs `shape_inference` to calculate the input and output size for each ONNX node, and it parses the ONNX model using `OnnxNodesAST.py` and creates a `SeeDot` AST which is stored as `model_name.pkl` (using pickle).
+
+4) The `compile.sh` script further converts the SeeDot AST to EzPC code and the EzPC code is finally converted to the `CPP` program. This `CPP` program is compiled and ran with the given input. The output is stored as `debug/cpp_output_raw.txt`. Again, using the same scale this raw output is converted to the floating-point output and stored in `debug/cpp_output.txt` for easier manual comparison with the original ONNX output. 
+
+# Debugging and Logging
+Since debugging the code is an arduous task, several things are logged in the following files
+
+To log the values of specific variables, the script can be run in debug mode using `./compile.sh model_name.onnx name_of_onnx_node`.
+
+`onnx_seedot_name_map.txt` It stores a map from ONNX names to SeeDot names of variables.
+
+`seedot_ezpc_name_map.txt` It stores a map from SeeDot names to EzPC names of variables.
+
+`onnx_ezpc_name_map.txt` The above two maps are combined to create a map that shows the mapping from ONNX names to EzPC/CPP names.
+
+`cpp_output_raw.txt` It contains the raw output after running the final code. In case if the script is run on `debug` mode with a debug name specified then the output has the values of the selected debug variable instead of the final variable.
+
+`cpp_output.txt` The above file is parsed and converted into a format where all fixed-point integer values are converted to the easily readable floating-point format. As earlier in the case of `debug` mode the output contains the value of debug variable.
+
+`onnx_debug.txt` In the debug mode this file contains the value of selected ONNX node computed using ONNX runtime.
+
+`onnx_output.txt` This file contains the value of output computed using ONNX runtime.
+
+`seedot_ast.txt` The output of `process_onnx.py` is logged in this. It includes the SeeDot AST generated.
+
+`seedot_to_ezpc_output.txt` The output of SeeDot compilation to EzPC is logged in this.
+
+# Dependency
+Other than EzPC dependencies:
+
+`onnx`
+
+`onnxruntime`
+
+# Testing
+`python3 -m unittest`
--- a/tools/SeeDot/seedot/compiler/ONNX/common.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/common.py
@ -0,0 +1,86 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import numpy
+import os
+import _pickle as pickle
+import re
+
+def proto_val_to_dimension_tuple(proto_val):
+	return tuple([dim.dim_value for dim in proto_val.type.tensor_type.shape.dim])
+
+def numpy_float_array_to_fixed_point_val_str(input_array, scale):
+	cnt = 0
+	chunk = ''
+	for val in numpy.nditer(input_array):
+		val = int(val * (2 ** scale))
+		chunk += str(val) + '\n'
+		cnt += 1
+	return (chunk, cnt)
+
+def numpy_float_array_to_float_val_str(input_array):
+	chunk = ''
+	for val in numpy.nditer(input_array):
+		chunk += str(val) + '\n'
+	return chunk
+
+def write_debug_info(node_name_to_out_var_dict):
+	if not os.path.exists('debug'):
+		os.makedirs('debug')
+
+	with open('debug/onnx_seedot_name_map.pkl', 'wb') as f:
+		pickle.dump(node_name_to_out_var_dict, f)
+
+	with open('debug/onnx_seedot_name_map.txt', 'w') as f:
+		for val in node_name_to_out_var_dict:
+			f.write(val + '   ' + node_name_to_out_var_dict[val] + '\n')
+
+def merge_name_map():
+	onnx_seedot_name_map = pickle.load(open('debug/onnx_seedot_name_map.pkl', 'rb'))
+	seedot_ezpc_name_map = pickle.load(open('debug/seedot_ezpc_name_map.pkl', 'rb'))
+
+	with open('debug/onnx_ezpc_name_map.txt', 'w') as f:
+		for val in onnx_seedot_name_map:
+			f.write(val + '   ' + seedot_ezpc_name_map[onnx_seedot_name_map[val]])
+
+def get_seedot_name_from_onnx_name(onnx_name):
+	onnx_seedot_name_map = pickle.load(open('debug/onnx_seedot_name_map.pkl', 'rb'))
+	print(onnx_seedot_name_map[onnx_name])
+
+def parse_output(scale):
+	f = open('debug/cpp_output_raw.txt', 'r')
+	g = open('debug/cpp_output.txt', 'w')
+	chunk = ''
+	for line in f:
+		if line.rstrip().replace('-', '0').isdigit():
+			val = float(line.rstrip())
+			val = val / (2 ** scale)
+			chunk += str(val) + '\n'
+	g.write(chunk)
+	g.close()
+
+def extract_txt_to_numpy_array(file):
+	f = open(file, 'r')
+	op = [float(line.rstrip()) for line in f]
+	f.close()
+	return numpy.array(op, dtype=numpy.float32)
+
+def match_debug(decimal=4):
+	a = extract_txt_to_numpy_array('debug/onnx_debug.txt')
+	b = extract_txt_to_numpy_array('debug/cpp_output.txt')
+	numpy.testing.assert_almost_equal(a, b, decimal)
+
+def match_output(decimal=4):
+	a = extract_txt_to_numpy_array('debug/onnx_output.txt')
+	b = extract_txt_to_numpy_array('debug/cpp_output.txt')
+	numpy.testing.assert_almost_equal(a, b, decimal)
+
+def add_openmp_threading_to_convolution(file):
+	with open(file, 'r+') as f:
+		newfilename = file[:-5] + '1.cpp'
+		g = open(newfilename, 'w')
+		content = f.read()
+		content1 =  re.sub('void Conv3D\(.*','\g<0> \n #pragma omp parallel for collapse(5) ', content)
+		content2 =  re.sub('void ConvTranspose3D\(.*','\g<0> \n #pragma omp parallel for collapse(5) ', content1)
+		g.write(content2)
+		g.close()
--- a/tools/SeeDot/seedot/compiler/ONNX/compile.sh
+++ b/tools/SeeDot/seedot/compiler/ONNX/compile.sh
@ -0,0 +1,110 @@
+#!/bin/bash
+
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+# This script will
+# 1) Compile the ONNX model to SeeDot AST.
+# 2) Compile the SeeDot AST to EzPC.
+# 3) Convert the EzPC code to CPP and then run it on the given dataset.
+
+# Any subsequent(*) commands which fail will cause the shell script to exit immediately.
+set -e
+
+modelName=$1
+debugOnnxNode=$2
+
+EzPCDir="../../EzPC"
+ONNX_dir="../../Athos/ONNXCompiler"
+data_dir="debug/"${modelName}
+BITLEN="64"
+SCALINGFACTOR="24"
+COMPILATIONTARGET="CPP"
+ezpcOutputFullFileName=${modelName}'.ezpc'
+compilationTargetLower=$(echo "$COMPILATIONTARGET" | awk '{print tolower($0)}')
+compilationTargetHigher=$(echo "$COMPILATIONTARGET" | awk '{print toupper($0)}')
+finalCodeOutputFileName=${modelName}'0.cpp'
+finalCodeOutputFileName1=${modelName}'1.cpp'
+inputFileName=${modelName}'_input.h'
+seedotASTName=${modelName}'.pkl'
+
+# modelname_input.npy and modelname_output.npy
+onnxInputFileName=${modelName}'_input.npy'
+onnxOutputFileName=${modelName}'_output.npy'
+
+GREEN='\033[0;32m'
+NC='\033[0m' # No Color
+
+mkdir -p debug
+mkdir -p ${data_dir}
+
+# Generating input may take time, hence skip if already generated.
+if [ -f  ${data_dir}"/"${inputFileName} ]; then
+	echo -e "${GREEN}$inputFileName already exist, skipping process_onnx${NC}"
+else
+	echo "Starting to generate random input"
+	python3 "create_input.py" ${modelName}'.onnx' $SCALINGFACTOR
+	echo -e "${GREEN}Finished generating input${NC}"
+fi
+
+echo "Starting onnx run"
+# can use either 'onnx_run_tf' or 'onnx_run'
+# onnx_run is faster and has lesser dependencies
+# but may not support all operations
+python3 "onnx_run.py" ${modelName}'.onnx' ${debugOnnxNode} > "debug/log_onnx_run.txt"
+echo -e "${GREEN}Finished onnx run${NC}"
+
+echo "Starting process_onnx"
+echo "Output of process_onnx and the resultant SeeDot AST are logged in debug/seedot_ast.txt"
+python3 "process_onnx.py" ${modelName}'.onnx' > "debug/seedot_ast.txt"
+echo -e "${GREEN}Finished process_onnx${NC}"
+
+echo "Starting SeeDot to EzPC compilation"
+echo "Output is logged in debug/seedot_to_ezpc_output.txt"
+
+if [ -z "$debugOnnxNode" ]; then
+	python3 ../SeeDot/SeeDot.py -p $seedotASTName --astFile ${data_dir}"/"$seedotASTName --outputFileName ${data_dir}"/"${ezpcOutputFullFileName} --consSF ${SCALINGFACTOR} --bitlen "$BITLEN" > "debug/seedot_to_ezpc_output.txt"
+else
+	debugSeedotNode=$(python3 -c "import common; common.get_seedot_name_from_onnx_name(\"${debugOnnxNode}\")")
+	echo "${debugSeedotNode} is the corresponding SeeDot name"
+	python3 ../SeeDot/SeeDot.py -p $seedotASTName --astFile ${data_dir}"/"$seedotASTName --outputFileName ${data_dir}"/"${ezpcOutputFullFileName} --consSF ${SCALINGFACTOR} --debugVar ${debugSeedotNode} --bitlen "$BITLEN" > "debug/seedot_to_ezpc_output.txt"
+fi
+echo -e "${GREEN}Finished SeeDot to EzPC compilation${NC}"
+
+python3 -c 'import common; common.merge_name_map()'
+
+cat "../TFEzPCLibrary/Library${BITLEN}_cpp.ezpc" "../TFEzPCLibrary/Library${BITLEN}_common.ezpc" ${data_dir}"/"${ezpcOutputFullFileName} > temp
+mv temp "$ezpcOutputFullFileName"
+
+mv "$ezpcOutputFullFileName" "$EzPCDir/EzPC"
+cd "$EzPCDir/EzPC"
+eval `opam config env`
+
+echo "Starting with EzPC to CPP compilation"
+./ezpc.sh "$ezpcOutputFullFileName" --bitlen "$BITLEN" --codegen "$compilationTargetHigher" --disable-tac
+echo -e "${GREEN}Finished EzPC to CPP compilation ${NC}"
+
+# Deleting the generated files.
+mv "$finalCodeOutputFileName" "$ONNX_dir"
+DIREZPC="${EzPCDir}/EzPC/${modelName}"
+for file in "$DIREZPC"*
+do
+  rm "${file}"
+done
+
+if [ "$compilationTargetLower" == "cpp" ]; then
+	cd "$ONNX_dir"
+	mv "$finalCodeOutputFileName" "$data_dir"
+
+	echo "Adding OpenMP threading instructions to the 3D Convolutions"
+	python3 -c "import common; common.add_openmp_threading_to_convolution('${data_dir}"/"${finalCodeOutputFileName}')"
+
+	echo "Compiling generated CPP code"
+	g++ -O3 -g -w -fopenmp ${data_dir}"/"${finalCodeOutputFileName1} -o ${data_dir}"/"${modelName}".out"
+	echo -e "${GREEN}compiling done ${NC}"
+	rm -f "debug/cpp_output_raw.txt" || true
+	echo "Running the final code"
+	eval './'${data_dir}'/'${modelName}'.out' < ${data_dir}'/'${inputFileName} > "debug/cpp_output_raw.txt"
+	python3 -c "import common; common.parse_output(${SCALINGFACTOR})"
+	echo -e "${GREEN}All operations done. ${NC}"
+fi
--- a/tools/SeeDot/seedot/compiler/ONNX/onnx_run.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/onnx_run.py
@ -0,0 +1,30 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import numpy as np
+import onnxruntime
+import common
+import os, sys
+import onnx
+from onnx import helper
+
+# First read the ONNX file.
+def get_onnx_output(model, input, intermediate_node=None):
+	sess = onnxruntime.InferenceSession(file_path)
+
+	x = input
+	x = x.astype(np.float32)
+
+	input_name = model.graph.input[0].name
+
+	if (intermediate_node != None):
+		intermediate_layer_value_info = helper.ValueInfoProto()
+		intermediate_layer_value_info.name = sys.argv[2]
+		model.graph.output.extend([intermediate_layer_value_info])
+		onnx.save(model, file_path + '_1')
+		sess = onnxruntime.InferenceSession(file_path + '_1')
+		pred = sess.run([intermediate_layer_value_info.name], {input_name: x})
+		return pred
+
+	pred = sess.run(None, {input_name: x})
+	return pred
--- a/tools/SeeDot/seedot/compiler/ONNX/onnx_run_tf.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/onnx_run_tf.py
@ -0,0 +1,75 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+'''
+onnx_run is faster but may not support all operations.
+onnx_run_tf uses tensorflow backend to run the inference.
+'''
+
+import numpy as np
+import common
+import os, sys
+import onnx
+from onnx import helper
+from onnx_tf.backend import prepare
+from onnx import TensorProto
+
+def main():
+	# First read the ONNX file.
+	if (len(sys.argv) < 2):
+		print("TF python file unspecified.", file=sys.stderr)
+		exit(1)
+
+	file_name = sys.argv[1]
+	file_path = 'models/' + file_name
+	model_name = file_name[:-5] # name without the '.onnx' extension
+	model = onnx.load(file_path)
+	model = preprocess_for_tf(model)
+
+	x = np.load('debug/' + model_name + '/' + model_name + '_input.npy')
+	x = x.astype(np.float32)
+
+	input_name = model.graph.input[0].name
+	output_name = model.graph.output[0].name
+
+	if (len(sys.argv) > 2):
+		intermediate_layer_value_info = helper.ValueInfoProto()
+		intermediate_layer_value_info_name = 'tf_' + sys.argv[2]
+		intermediate_layer_value_info = helper.make_tensor_value_info(intermediate_layer_value_info_name, TensorProto.FLOAT, [])
+		model.graph.output.extend([intermediate_layer_value_info])
+		output = prepare(model).run(x) 
+		pred = getattr(output, intermediate_layer_value_info_name)
+		np.save('debug/' + model_name + '/' + model_name + '_debug', pred)
+		with open('debug/onnx_debug.txt', 'w') as f:
+			f.write(common.numpy_float_array_to_float_val_str(pred))
+		print("Saving the ONNX runtime intermediate output for " + intermediate_layer_value_info.name)
+		exit()
+
+	output = prepare(model).run(x)
+	pred = getattr(output, output_name)
+	np.save('debug/' + model_name + '/' + model_name + '_output', pred)
+	with open('debug/onnx_output.txt', 'w') as f:
+			f.write(common.numpy_float_array_to_float_val_str(pred))
+	output_dims = common.proto_val_to_dimension_tuple(model.graph.output[0])
+	print("Saving the ONNX runtime output of dimension " + str(output_dims))
+
+def preprocess_for_tf(model):
+	for init_vals in model.graph.initializer:
+		init_vals.name = 'tf_' + init_vals.name
+
+	for inp in model.graph.input:
+		inp.name = 'tf_' + inp.name
+
+	for op in model.graph.output:
+		op.name = 'tf_' + op.name
+
+	for node in model.graph.node:
+		node.name = 'tf_' + node.name
+		for i in range(len(node.input)):
+			node.input[i] = 'tf_' + node.input[i]
+		for i in range(len(node.output)):
+			node.output[i] = 'tf_' + node.output[i]
+	return model
+
+if __name__ == "__main__":
+	main()
--- a/tools/SeeDot/seedot/compiler/ONNX/onnx_test_run.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/onnx_test_run.py
@ -0,0 +1,65 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import numpy as np
+import onnxruntime
+import common
+import os, sys
+import onnx
+from onnx import helper
+
+file_path = '../../../model/lenet/cifar-multiclass/input.onnx'
+model = onnx.load(file_path)
+sess = onnxruntime.InferenceSession(file_path) 
+
+dataset_path = '../../../datasets/lenet/cifar-multiclass/test_onnx.npy'
+test = np.load(dataset_path)
+
+run_all = True
+intermediate = None
+ 
+correct = 0
+total = 0
+
+for i in range(test.shape[0] if run_all else 1):
+	x = test[i,1:].reshape(-1,1)
+	# x = test[i,1:].reshape(1,32,32,3).transpose(0,3,1,2).reshape(-1,1)
+	output = test[i,0]
+
+	# print(x.shape)
+	# print(output)
+
+	input_name = model.graph.input[0].name
+	x = x.astype(np.float32)
+
+	if (intermediate is not None):
+		intermediate_layer_value_info = helper.ValueInfoProto()
+		intermediate_layer_value_info.name = intermediate
+		model.graph.output.extend([intermediate_layer_value_info])
+		onnx.save(model, file_path + '_1')
+		sess = onnxruntime.InferenceSession(file_path + '_1')
+		pred = sess.run([intermediate_layer_value_info.name], {input_name: x})
+		# np.save('debug/' + model_name + '/' + model_name + '_debug', pred)
+		# with open('debug/onnx_debug.txt', 'w') as f:
+		# 	f.write(common.numpy_float_array_to_float_val_str(pred))
+		# print("Saving the onnx runtime intermediate output for " + intermediate_layer_value_info.name)
+		print(len(pred))
+		print(pred[0])
+		exit()
+
+	pred = sess.run(None, {input_name: x})
+
+	predicted_class = pred[0][0] + 1
+	print(predicted_class)
+	print(int(output))
+
+	correct += (predicted_class == int(output))
+	total += 1
+
+	# np.save('debug/' + model_name + '/' + model_name + '_output', pred)
+	# with open('debug/onnx_output.txt', 'w') as f:
+	# 		f.write(common.numpy_float_array_to_float_val_str(pred))
+	# output_dims = common.proto_val_to_dimension_tuple(model.graph.output[0])
+	# print("Saving the onnx runtime output of dimension " + str(output_dims))
+
+print(str((float(correct)*100)/float(total)) + '% is the accuracy')
--- a/tools/SeeDot/seedot/compiler/ONNX/paramsBuilderOnnx.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/paramsBuilderOnnx.py
@ -0,0 +1,80 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import numpy.random
+import numpy as np
+import os, sys
+import onnx
+from onnx import helper
+import math
+from onnx import numpy_helper
+
+import seedot.compiler.ONNX.common as common
+
+
+#TODO: Refactor this later. Also part of paramBuilder file.
+class Param:
+
+    def __init__(self, name, shape, range):
+        self.name = name
+        self.shape = shape
+        self.range = range
+        self.sparse = False
+
+
+#TODO: Shift to common.py
+def get_range(np_array):
+	return (np.min(np_array), np.max(np_array))
+
+def getParams(file_path):
+	model = onnx.load(file_path)
+	graph_def = model.graph
+
+	model_name_to_val_dict = { init_vals.name: numpy_helper.to_array(init_vals).tolist() for init_vals in model.graph.initializer}
+
+	paramList = []
+
+	for init_vals in model.graph.initializer:
+		name = 	init_vals.name
+		shape = numpy_helper.to_array(init_vals).shape
+		range = get_range(numpy_helper.to_array(init_vals))
+		param = Param(name, shape, range)
+		param.data = numpy_helper.to_array(init_vals).reshape((1,-1)).tolist()
+		paramList.append(param)
+
+	return paramList
+
+def preprocess_batch_normalization(graph_def, model_name_to_val_dict):
+	# Set names to graph nodes if not present.
+	for node in graph_def.node:
+		node.name = node.output[0]
+		# Update the batch normalization scale and B
+		# so that mean and var are not required.
+		if(node.op_type == 'BatchNormalization'):
+			# scale
+			gamma = model_name_to_val_dict[node.input[1]]
+			# B
+			beta = model_name_to_val_dict[node.input[2]]
+			mean = model_name_to_val_dict[node.input[3]]
+			var = model_name_to_val_dict[node.input[4]]
+			for i in range(len(gamma)):
+				rsigma = 1/math.sqrt(var[i]+1e-5)
+				gamma[i] = gamma[i]*rsigma
+				beta[i] = beta[i]-gamma[i]*mean[i]
+				mean[i] = 0
+				var[i] = 1-1e-5
+
+	# Just testing if the correct values are set.
+	model_name_to_val_dict2 = {}
+	for init_vals in graph_def.initializer:
+		# TODO: Remove float_data.
+		model_name_to_val_dict2[init_vals.name] = init_vals.float_data
+	for node in graph_def.node:
+		node.name = node.output[0]
+		if(node.op_type == 'BatchNormalization'):
+			mean = model_name_to_val_dict[node.input[3]]
+			for val in mean:
+				assert(val == 0)
+
+if __name__ == "__main__":
+	main()
--- a/tools/SeeDot/seedot/compiler/ONNX/process_onnx.py
+++ b/tools/SeeDot/seedot/compiler/ONNX/process_onnx.py
@ -0,0 +1,145 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import os, sys
+
+#Add SeeDot directory to path
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'SeeDot'))
+
+# For this warning: https://stackoverflow.com/questions/47068709/your-cpu-supports-instructions-that-this-tensorflow-binary-was-not-compiled-to-u
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
+
+import _pickle as pickle
+import onnx
+import onnx.shape_inference
+from onnx.helper import make_tensor_value_info
+from onnx import TensorProto
+
+import seedot.compiler.ast.ast as AST
+from seedot.compiler.ast.printAST import PrintAST
+from seedot.compiler.ast.mtdAST import MtdAST
+from seedot.compiler.ONNX.ONNXNodesAST import ONNXNodesAST
+
+import numpy
+import seedot.compiler.ONNX.common as common
+
+import numpy as np
+np.set_printoptions(threshold=np.inf)
+
+DEBUG = False
+out_var_prefix = "J"
+
+
+def process_input_variables(program, innermost_let_ast_node, node_name_to_out_var_dict, out_var_count, mtdAST, graph_def, value_info):
+	node = graph_def.input[0]
+	curAst = ONNXNodesAST.Input(node, value_info, node_name_to_out_var_dict)
+	mtdForCurAST = {AST.ASTNode.mtdKeyTFOpName : 'Input',
+                  AST.ASTNode.mtdKeyTFNodeName : node.name}
+	cur_out_var_ast_node = AST.ID(node.name)
+
+	if program:
+		assert(type(innermost_let_ast_node) is AST.Let)
+		newNode = AST.Let(node.name, curAst, cur_out_var_ast_node)
+		# mtdAST.visit(newNode, mtdForCurAST)
+		# Updating the innermost Let AST node and the expression for previous Let node.
+		innermost_let_ast_node.expr = newNode
+		innermost_let_ast_node = newNode
+	else:
+		innermost_let_ast_node = AST.Let(node.name, curAst, cur_out_var_ast_node)
+		# mtdAST.visit(innermost_let_ast_node, mtdForCurAST)
+		innermost_let_ast_node.depth = 0
+		program = innermost_let_ast_node
+
+	node_name_to_out_var_dict[node.name] = node.name
+
+	for node in graph_def.initializer:
+		if(DEBUG):
+			print("Node information")
+			print(node)
+
+		curAst = ONNXNodesAST.Input(node, value_info, node_name_to_out_var_dict, node)
+		mtdForCurAST = {AST.ASTNode.mtdKeyTFOpName : 'Input',
+							AST.ASTNode.mtdKeyTFNodeName : node.name}
+		if (curAst is None):
+			continue
+
+		cur_out_var_ast_node = AST.ID(node.name)
+
+		if program:
+			assert(type(innermost_let_ast_node) is AST.Let)
+			newNode = AST.Let(node.name, curAst, cur_out_var_ast_node)
+			# mtdAST.visit(newNode, mtdForCurAST)
+			# Updating the innermost Let AST node and the expression for previous Let node.
+			innermost_let_ast_node.expr = newNode
+			innermost_let_ast_node = newNode
+		else:
+			innermost_let_ast_node = AST.Let(node.name, curAst, cur_out_var_ast_node)
+			# mtdAST.visit(innermost_let_ast_node, mtdForCurAST)
+			innermost_let_ast_node.depth = 0
+			program = innermost_let_ast_node
+
+		node_name_to_out_var_dict[node.name] = node.name
+	return (program, innermost_let_ast_node, out_var_count)
+
+def process_onnx_nodes(innermost_let_ast_node, node_name_to_out_var_dict, out_var_count, mtdAST, graph_def, value_info):
+	for node in graph_def.node:
+		if(DEBUG):
+			print("Node information")
+			print(node)
+
+		print("Processing " + node.op_type + "\n")
+
+		func = getattr(ONNXNodesAST, node.op_type)
+		(innermost_let_ast_node, out_var_count) = func(node, value_info, node_name_to_out_var_dict, innermost_let_ast_node, out_var_count, mtdAST)
+
+		assert(type(innermost_let_ast_node) is AST.Let)
+
+def get_seedot_ast(file_path):
+	sys.setrecursionlimit(10000)
+	print(os.getcwd())
+
+	# Load the model and extract the graph.
+	model = onnx.load(file_path)
+	graph_def = model.graph
+
+	# print(model.graph.value_info)
+	# Before shape inference (model.graph.value_info) should have shapes of all the variables and constants.
+	model.graph.value_info.append(make_tensor_value_info(model.graph.input[0].name, TensorProto.FLOAT, common.proto_val_to_dimension_tuple(model.graph.input[0])))
+	model.graph.value_info.append(make_tensor_value_info(model.graph.output[0].name, TensorProto.FLOAT, common.proto_val_to_dimension_tuple(model.graph.output[0])))
+
+	# print(model.graph.value_info)
+
+	for init_vals in model.graph.initializer:
+		model.graph.value_info.append(make_tensor_value_info(init_vals.name, TensorProto.FLOAT, tuple(init_vals.dims)))
+
+	if(DEBUG):
+		print("Shape inference *****************")
+		print(model.graph.value_info)
+
+	inferred_model = onnx.shape_inference.infer_shapes(model)
+
+	if(DEBUG):
+		print("Printing shape ******************")
+		print(inferred_model.graph.value_info)
+		print("Done ******************")
+
+	# value_info: dictionary of name -> (type, dimension tuple)
+	value_info = {}
+	for val in inferred_model.graph.value_info:
+		value_info[val.name] = (val.type.tensor_type.elem_type, common.proto_val_to_dimension_tuple(val))
+
+	# Iterate through the ONNX graph nodes and translate them to SeeDot AST nodes.
+	program = None
+	innermost_let_ast_node = None
+	node_name_to_out_var_dict = {}
+	out_var_count = 0
+	mtdAST = MtdAST()
+
+	(program, innermost_let_ast_node, out_var_count) = process_input_variables(program, innermost_let_ast_node, node_name_to_out_var_dict, out_var_count, mtdAST, graph_def, value_info)
+
+	process_onnx_nodes(innermost_let_ast_node, node_name_to_out_var_dict, out_var_count, mtdAST, graph_def, value_info)
+
+	PrintAST().visit(program)
+
+	common.write_debug_info(node_name_to_out_var_dict)
+	return program
--- a/tools/SeeDot/seedot/compiler/TF/Graph.py
+++ b/tools/SeeDot/seedot/compiler/TF/Graph.py
@ -0,0 +1,667 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+ 
+import sys
+import enum
+
+# Util functions
+
+def errIfTokensNotMinLen(tokens, minlen, lineNum, entity):
+    errCond = (len(tokens) < minlen)
+    if (errCond):
+        print("Less than expected number of tokens found while parsing",
+              entity, " at line =", lineNum, file=sys.stderr)
+    return errCond
+
+
+class DataTypeEnum(enum.Enum):
+
+    DT_INVALID = 0
+    DT_FLOAT = 1
+    DT_BOOL = 2
+    DT_INT32 = 3
+    DT_INT64 = 4
+
+    def Parse(str):
+        if (str == "DT_FLOAT"):
+            return DataTypeEnum.DT_FLOAT
+        elif (str == "DT_BOOL"):
+            return DataTypeEnum.DT_BOOL
+        elif (str == "DT_INT32"):
+            return DataTypeEnum.DT_INT32
+        elif (str == "DT_INT64"):
+            return DataTypeEnum.DT_INT64
+        else:
+            return DataTypeEnum.DT_INVALID
+
+    # TODO: The size given below is for C++. For Python the sizes are different.
+    def Size(dt):
+        if (dt == DataTypeEnum.DT_INVALID):
+            return 0
+        elif (dt == DataTypeEnum.DT_FLOAT):
+            return 4
+        elif (dt == DataTypeEnum.DT_BOOL):
+            return 1
+        elif (dt == DataTypeEnum.DT_INT32):
+            return 4
+        elif (dt == DataTypeEnum.DT_INT64):
+            return 8
+        else:
+            raise Exception('Invalid dataTypeEnum value.')
+
+
+class Shape:
+
+    def __init__(self, dimList=None):
+        if (dimList is None):
+            self.__dimList = []
+        else:
+            self.__dimList = dimList
+        self.__unknownRank = False
+
+    def getNumElements(self):
+        if self.__dimList:
+            # List is non-empty.
+            ans = 1
+            for curDim in self.__dimList:
+                ans *= curDim
+            return ans
+        else:
+            # List is empty.
+            return 0
+
+    def getRank(self):
+        return len(self.__dimList)
+
+    def getDimRef(self):
+        return self.__dimList
+
+    def shapeUnknown(self):
+        for curDim in self.__dimList:
+            if (curDim == -1):
+                return True
+        return False
+
+    def equalButOneDim(self, other, ignore_dim):
+        if(self.__unknownRank != other.__unknownRank):
+            return False
+        if(self.__unknownRank == other.__unknownRank and self.__unknownRank == True):
+            return True
+        if (len(self.__dimList) != len(other.__dimList)):
+            return False
+        for i, curDim in enumerate(self.__dimList):
+            if (i != ignore_dim and curDim != other.__dimList[i]):
+                return False
+        return True
+
+    def __eq__(self, other):
+        if (self.__unknownRank != other.__unknownRank):
+            return False
+        if ((self.__unknownRank == other.__unknownRank) and (self.__unknownRank == True)):
+            return True
+        if (len(self.__dimList) != len(other.__dimList)):
+            return False
+        for i, dimVal in enumerate(self.__dimList):
+            if (other.__dimList[i] != dimVal):
+                return False
+        return True
+
+    def __getitem__(self, index):
+        return self.__dimList[index]
+
+    def readFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "Shape")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "dim"):
+                nline = fileP.readline()
+                cnt += 1
+                nlineTokens = nline.split()
+                if (nlineTokens[0] == "}"):
+                    return (True, cnt)
+                elif (len(nlineTokens) != 2 or nlineTokens[0] != "size:"):
+                    print("Error while parsing dim in shape at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+
+                nnline = fileP.readline()
+                cnt += 1
+                nnlineTokens = nnline.split()
+                if (len(nnlineTokens) != 1 or nnlineTokens[0] != "}"):
+                    print("Error while parsing dim in shape at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+
+                self.__dimList.append(int(nlineTokens[1]))
+            elif (curToken == "unknown_rank:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Shape")):
+                    return (False, cnt)
+                self.__unknownRank = bool(tokens[1])
+            else:
+                print("Unknown token found while parsing shape at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def print(self):
+        if (self.__unknownRank):
+            print("Unknown rank")
+        else:
+            print("Size:", ",".join(list(map(str, self.__dimList))))
+
+
+class Tensor:
+    # TODO : Not implemented operator Shape() from the C++ implementation.
+
+    def __init__(self):
+        # In the input, either tensor content in the form of a binary string is provided
+        # or an int/float/bool value is provided and the tensor shape is provided.
+        # The corresponding C++ code converts the int/float/bool array into byte array.
+        # Right now I don't see any use of doing this in python implementation.
+        # So, for now, either __valArr will be non-null or __tensorBytes.
+        # TODO: If need arises, change everything to byte array.
+        self.__totalSize = None
+        self.__dtype = None
+        self.__tensorShape = None
+        self.__tensorContentInput = None
+        self.__tensorBytes = None
+        self.__valInput = None
+        self.__valArr = None
+
+    def getShapeRef(self):
+        return self.__tensorShape
+
+    def __convToBytes(self):
+        numElements = self.__tensorShape.getNumElements()
+
+        # TODO: The totalSize calculated below seems inaccurate because empty list in python is itself 64 bytes,
+        # meaning that the actual size on a python system will be much more.
+
+        self.__totalSize = DataTypeEnum.Size(self.__dtype)
+        self.__totalSize *= numElements
+
+        if ((self.__dtype == DataTypeEnum.DT_BOOL) and (self.__valInput is not None)):
+            self.__valArr = [self.__valInput]*numElements
+        elif ((self.__dtype == DataTypeEnum.DT_FLOAT) and (self.__valInput is not None)):
+            self.__valArr = [self.__valInput]*numElements
+        elif ((self.__dtype == DataTypeEnum.DT_INT32 or self.__dtype == DataTypeEnum.DT_INT64)
+                and (self.__valInput is not None)):
+            self.__valArr = [self.__valInput]*numElements
+        else:
+            # By virtue of how this function is called from below.
+            assert(self.__tensorContentInput)
+            # Parse the tensorcontent and fill the tensorbytes.
+
+            self.__tensorBytes = bytearray(self.__totalSize)
+
+            byteArrIdx = 0
+            tsCtnIdx = 0
+            while(self.__tensorContentInput[tsCtnIdx] != '\"'):
+                tsCtnIdx += 1
+            tsCtnIdx += 1
+            while(tsCtnIdx + 1 < len(self.__tensorContentInput)):
+                if (self.__tensorContentInput[tsCtnIdx] != '\\'):
+                    self.__tensorBytes[byteArrIdx] = ord(
+                        self.__tensorContentInput[tsCtnIdx])
+                else:
+                    if (self.__tensorContentInput[tsCtnIdx+1] == 'n'):
+                        self.__tensorBytes[byteArrIdx] = 10
+                        tsCtnIdx += 1
+                    else:
+                        self.__tensorBytes[byteArrIdx] = (ord(self.__tensorContentInput[tsCtnIdx+1])-ord('0'))*64 + (ord(
+                            self.__tensorContentInput[tsCtnIdx+2])-ord('0'))*8 + (ord(self.__tensorContentInput[tsCtnIdx+3])-ord('0'))
+                        tsCtnIdx += 3
+                byteArrIdx += 1
+                tsCtnIdx += 1
+
+    def readFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "Tensor")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "tensor_shape"):
+                sh = Shape()
+                (noParseErr, cnt) = sh.readFromFilePointer(fileP, cnt)
+                if not(noParseErr):
+                    print(
+                        "Error in reading shape while parsing tensor at line =", cnt, file=sys.stderr)
+                    return (False, cnt)
+                self.__tensorShape = sh
+                if (len(self.__tensorShape.getDimRef()) == 0):
+                    self.__totalSize = 0
+            elif (curToken == "dtype:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Tensor")):
+                    return (False, cnt)
+                dtype = DataTypeEnum.Parse(tokens[1])
+                if (dtype == DataTypeEnum.DT_INVALID):
+                    print(
+                        "Unknown dtype found while parsing Tensor at line =", cnt, file=sys.stderr)
+                    return (False, cnt)
+                else:
+                    self.__dtype = dtype
+            elif (curToken == "tensor_content:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Tensor")):
+                    return (False, cnt)
+                self.__tensorContentInput = tokens[1]
+                self.__convToBytes()
+            elif (curToken == "float_val:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Tensor")):
+                    return (False, cnt)
+                self.__valInput = float(tokens[1])
+                self.__convToBytes()
+            elif (curToken == "bool_val:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Tensor")):
+                    return (False, cnt)
+                self.__valInput = bool(tokens[1])
+                self.__convToBytes()
+            elif (curToken == "int_val:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Tensor")):
+                    return (False, cnt)
+                self.__valInput = int(tokens[1])
+                self.__convToBytes()
+            else:
+                print("Unknown token found while parsing Tensor at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def print(self):
+        print("DType:", self.__dtype)
+        print("Shape: ", end="")
+        self.__tensorShape.print()
+        print("Content:", self.__tensorContentInput)
+        print("ShapeRank:", self.__tensorShape.getRank())
+        print("TotalSizeBytes:", self.__totalSize)
+        if (self.__tensorBytes):
+            print("ActualContentBytes:", self.__tensorBytes)
+        else:
+            print("ValArr:", self.__valArr)
+
+    def getConstantVal(self):
+        return self.__valInput
+
+    def getContentAsValArr(self):
+        # This will try and return an array of values (even when tensorContent is given as array of bytes).
+        if self.__valArr:
+            pass
+        else:
+            # Convert tensorBytes into array of values.
+            numOfElements = self.__tensorShape.getNumElements()
+            numOfBytesPerVal = None
+            if self.__dtype == DataTypeEnum.DT_INT32:
+                numOfBytesPerVal = 4
+            else:
+                # Right now the CNN tensorflow benchmark I'm dealing with only has int32 case when tensorContents are given as bytes.
+                # If in future, we encounter this case for float/bool, deal with it accordingly here.
+                # Plus, also from empirical observation, byteorder is little and its a signed value for ints.
+                # Figure out for others when the time comes.
+                print(self.__dtype)
+                assert False
+            it = 0
+            returnArr = []
+            while(it <= len(self.__tensorBytes)-1):
+                curInt = int.from_bytes(
+                    self.__tensorBytes[it:it+numOfBytesPerVal], byteorder='little', signed=True)
+                returnArr.append(curInt)
+                it += numOfBytesPerVal
+            self.__valArr = returnArr
+        return self.__valArr
+
+
+class MultiValue:
+
+    def __init__(self):
+        self.__valStrLi = []
+        self.__valIntLi = []
+        self.__valFloatLi = []
+        self.__valBoolLi = []
+
+    def readFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "Multivalue")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "s:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Multivalue")):
+                    return (False, cnt)
+                self.__valStrLi.append(tokens[1])
+            elif (curToken == "f:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Multivalue")):
+                    return (False, cnt)
+                self.__valFloatLi.append(float(tokens[1]))
+            elif (curToken == "i:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Multivalue")):
+                    return (False, cnt)
+                self.__valIntLi.append(int(tokens[1]))
+            elif (curToken == "b:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Multivalue")):
+                    return (False, cnt)
+                self.__valBoolLi.append(bool(tokens[1]))
+            else:
+                print("Unknown token found while parsing Multivalue, line =",
+                      cnt, ", Token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def print(self):
+        print("sses:", ",".join(self.__valStrLi))
+        print("is:", ",".join(list(map(str, self.__valIntLi))))
+        print("fs:", ",".join(list(map(str, self.__valFloatLi))))
+        print("bs:", ",".join(list(map(str, self.__valBoolLi))))
+
+    def getILi(self):
+        if len(self.__valIntLi) > 0:
+            assert(all((type(x) is int) for x in self.__valIntLi))
+        return self.__valIntLi
+
+
+class Value:
+
+    def __init__(self):
+        self.__val = None
+
+    def readFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "Value")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "s:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Value")):
+                    return (False, cnt)
+                self.__val = tokens[1]
+            elif (curToken == "i:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Value")):
+                    return (False, cnt)
+                self.__val = int(tokens[1])
+            elif (curToken == "f:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Value")):
+                    return (False, cnt)
+                self.__val = float(tokens[1])
+            elif (curToken == "b:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Value")):
+                    return (False, cnt)
+                self.__val = bool(tokens[1] == "true")
+            elif (curToken == "type:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "Value")):
+                    return (False, cnt)
+                dtype = DataTypeEnum.Parse(tokens[1])
+                if (dtype == DataTypeEnum.DT_INVALID):
+                    print("Invalid dtype found while parsing Value at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+                else:
+                    self.__val = dtype
+            elif (curToken == "shape"):
+                sh = Shape()
+                (noParseError, cnt) = sh.readFromFilePointer(fileP, cnt)
+                if (not(noParseError)):
+                    print("Error in parsing Value at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+                self.__val = sh
+            elif (curToken == "list"):
+                mv = MultiValue()
+                (noParseError, cnt) = mv.readFromFilePointer(fileP, cnt)
+                if (not(noParseError)):
+                    print("Error in parsing Value at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+                self.__val = mv
+            elif (curToken == "tensor"):
+                ts = Tensor()
+                (noParseError, cnt) = ts.readFromFilePointer(fileP, cnt)
+                if (not(noParseError)):
+                    print("Error in parsing Value at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+                self.__val = ts
+            else:
+                print("Unknown token while parsing Value at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def print(self):
+        if (type(self.__val) is str):
+            print("s:", self.__val)
+        elif (type(self.__val) is int):
+            print("i:", self.__val)
+        elif (type(self.__val) is float):
+            print("f:", self.__val)
+        elif (type(self.__val) is bool):
+            print("b:", self.__val)
+        elif (type(self.__val) is DataTypeEnum):
+            print("Type:", self.__val)
+        elif (type(self.__val) is Shape):
+            print("Shape: ", end="")
+            self.__val.print()
+        elif (type(self.__val) is Tensor):
+            print("Tensor: ", end="")
+            self.__val.print()
+        elif (type(self.__val) is MultiValue):
+            print("List: ", end="")
+            self.__val.print()
+        else:
+            assert(False)
+
+    def getS(self):
+        assert(type(self.__val) is str)
+        return self.__val
+
+    def getI(self):
+        assert(type(self.__val) is int)
+        return self.__val
+
+    def getF(self):
+        assert(type(self.__val) is float)
+        return self.__val
+
+    def getB(self):
+        assert(type(self.__val) is bool)
+        return self.__val
+
+    def getDataType(self):
+        assert(type(self.__val) is DataTypeEnum)
+        return self.__val
+
+    def getShape(self):
+        assert(type(self.__val) is Shape)
+        return self.__val
+
+    def getTensor(self):
+        assert(type(self.__val) is Tensor)
+        return self.__val
+
+    def getList(self):
+        assert(type(self.__val) is MultiValue)
+        return self.__val
+
+
+class Node:
+    def __init__(self):
+        self.__name = ""  # Name of node.
+        self.__op = ""  # Name of operation carried out by node.
+        self.__inputs = []  # List of all inputs to the current node.
+        # Map of (attrName, Value) of all attributes for the current node.
+        self.__attr = {}
+
+    def getName(self):
+        return self.__name
+
+    def getOp(self):
+        return self.__op
+
+    def getInputsRef(self):
+        return self.__inputs
+
+    def getAttrMapRef(self):
+        return self.__attr
+
+    def readAttrFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        keyStr = None
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "attr from node")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "key:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "attr from node")):
+                    return (False, cnt)
+                if (keyStr):
+                    # keyStr is already non-None. There is probably some error.
+                    print(
+                        "Too many keys found while parsing attr for node at line =", cnt, file=sys.stderr)
+                    return (False, cnt)
+                keyStr = tokens[1]
+            elif (curToken == "value"):
+                curVal = Value()
+                (noParseError, cnt) = curVal.readFromFilePointer(fileP, cnt)
+                if not(noParseError):
+                    print(
+                        "Error while parsing value of attr for node at line =", cnt, file=sys.stderr)
+                    return (False, cnt)
+                if not(keyStr):
+                    print(
+                        "Value found - but no key found for attr in node at line =", cnt, file=sys.stderr)
+                    return (False, cnt)
+                self.__attr[keyStr] = curVal
+            else:
+                print("Unrecognized token found while parsing attribute for node at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def readFromFilePointer(self, fileP, cnt):
+        line = fileP.readline()
+        cnt += 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "node")):
+                return (False, cnt)
+            curToken = tokens[0]
+            if (curToken == "}"):
+                return (True, cnt)
+            elif (curToken == "name:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "node")):
+                    return (False, cnt)
+                self.__name = tokens[1]
+            elif (curToken == "op:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "node")):
+                    return (False, cnt)
+                self.__op = tokens[1]
+            elif (curToken == "input:"):
+                if (errIfTokensNotMinLen(tokens, 2, cnt, "node")):
+                    return (False, cnt)
+                self.__inputs.append(tokens[1])
+            elif (curToken == "attr"):
+                (noParseError, cnt) = self.readAttrFromFilePointer(fileP, cnt)
+                if (not(noParseError)):
+                    print("Error parsing node data at line =",
+                          cnt, file=sys.stderr)
+                    return (False, cnt)
+            else:
+                print("Unrecognized token found while parsing node data at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return (False, cnt)
+            line = fileP.readline()
+            cnt += 1
+        return (False, cnt)
+
+    def print(self):
+        print("NODE::")
+        print(self.__name, ",", self.__op)
+        print("Inputs:")
+        for inp in self.__inputs:
+            print(inp)
+        for attrKey, attrVal in self.__attr.items():
+            print("Attr:", attrKey)
+            attrVal.print()
+
+
+class Graph:
+    def __init__(self):
+        self.__Nodes = {}  # Map of (op, Node).
+        # Sequential list of nodes in the order in which its specified in graph_def.
+        self.__NodesLi = []
+
+    def getAllNodesRef(self):
+        return self.__NodesLi
+
+    def setNodesList(self, nodesLi):
+        self.__NodesLi = nodesLi
+
+    def readFromFilePointer(self, fileP):
+        line = fileP.readline()
+        cnt = 1
+        while line:
+            tokens = line.split()
+            if (errIfTokensNotMinLen(tokens, 1, cnt, "graph")):
+                return False
+            curToken = tokens[0]
+            if (curToken == "node"):
+                curNode = Node()
+                (noPaseError, cnt) = curNode.readFromFilePointer(fileP, cnt)
+                if (noPaseError):
+                    self.__Nodes[curNode.getOp()] = curNode
+                    self.__NodesLi.append(curNode)
+                else:
+                    print("Error parsing graph dump for node at line =",
+                          cnt, file=sys.stderr)
+                    return False
+            elif (curToken == "}"):
+                # CurNode ended.
+                pass
+            elif (curToken == "versions"):
+                print("Versions node found. Ignoring remainder graph. Line =",
+                      cnt, file=sys.stderr)
+                return True
+            else:
+                print("Unrecognized token in graph dump at line =",
+                      cnt, ", token =", curToken, file=sys.stderr)
+                return False
+            line = fileP.readline()
+            cnt += 1
+        print("Graph parsing successful.")
+        return True
+
+    def __getitem__(self, opName):
+        return self.__Nodes[opName]
+
+    def print(self):
+        for _, curNode in self.__Nodes.items():
+            curNode.print()
--- a/tools/SeeDot/seedot/compiler/TF/ProcessTFGraph.py
+++ b/tools/SeeDot/seedot/compiler/TF/ProcessTFGraph.py
@ -0,0 +1,179 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import os
+import pickle
+import sys
+
+import seedot.compiler.ast.ast as AST
+from seedot.compiler.ast.printAST import PrintAST
+from seedot.compiler.ast.mtdAST import MtdAST
+
+import seedot.compiler.TF.Graph as Graph
+from seedot.compiler.TF.TFNodesAST import TFNodesAST
+
+def checkTFNodeNameForEq(curNodeOp: str, givenOp: str):
+    return (curNodeOp == "\"" + givenOp + "\"")
+
+def generateASTForNode(graph, curNode, dictNodeNameToOutVarStr, extraNodeInfoDict):
+    # print("===>>> Generating AST for (nodeOp, nodeName) : (" + curNode.getOp() + ", " + curNode.getName() + ")")
+    curNodeOp = curNode.getOp()
+    ast = None
+    # To remove the " at the begin and end.
+    func = getattr(TFNodesAST, curNodeOp[1:-1])
+    (assignedVarAST, curAST) = func(graph, curNode,
+                                    dictNodeNameToOutVarStr, extraNodeInfoDict)
+    return (assignedVarAST, curAST)
+
+# Takes the graph DS and outputs IR in SeeDot for the same.
+def generateIRCode(graph, extraInfoDict):
+    program = None
+    innerMostLetASTNode = None
+    dictNodeNameToOutVarStr = {}
+    outVarCt = 0
+    outVarPrefix = "J"
+    mtdAST = MtdAST()
+    for curNode in graph.getAllNodesRef():
+        for curInp in curNode.getInputsRef():
+            # Consequence of topological sorting of the TF graph.
+            assert(curInp in dictNodeNameToOutVarStr)
+        (assignedVarAST, curAst) = generateASTForNode(
+            graph, curNode, dictNodeNameToOutVarStr, extraInfoDict)
+
+        mtdForCurAST = {AST.ASTNode.mtdKeyTFOpName: curNode.getOp()[1:-1],
+                        AST.ASTNode.mtdKeyTFNodeName: curNode.getName()[1:-1]}
+
+        if (curAst is None):
+            dictNodeNameToOutVarStr[curNode.getName()] = None
+            continue
+        curOutVarStr = outVarPrefix + str(outVarCt)
+        curOutVarAstNode = (
+            assignedVarAST if assignedVarAST else AST.ID(curOutVarStr))
+        if program:
+            assert(type(innerMostLetASTNode) is AST.Let)
+            newNode = AST.Let(curOutVarAstNode, curAst, curOutVarAstNode)
+            mtdAST.visit(newNode, mtdForCurAST)
+            innerMostLetASTNode.expr = newNode
+            innerMostLetASTNode = newNode
+        else:
+            innerMostLetASTNode = AST.Let(
+                AST.ID(curOutVarStr), curAst, curOutVarAstNode)
+            mtdAST.visit(innerMostLetASTNode, mtdForCurAST)
+            innerMostLetASTNode.depth = 0
+            program = innerMostLetASTNode
+        dictNodeNameToOutVarStr[curNode.getName()] = curOutVarStr
+        outVarCt += 1
+    return (program, dictNodeNameToOutVarStr)
+
+def countUniqueOps(graph):
+    allOps = []
+    for curNode in graph.getAllNodesRef():
+        if (curNode.getOp() not in allOps):
+            allOps.append(curNode.getOp())
+    print("allOps.ct = ", len(allOps))
+    gradientDesOps = []
+    for curNode in graph.getAllNodesRef():
+        if (curNode.getName().startswith("\"gradient_descent_optimizer")) and (curNode.getOp() not in gradientDesOps):
+            gradientDesOps.append(curNode.getOp())
+    print("allOps ct for gradient descent optimiser = ", len(gradientDesOps))
+    return allOps
+
+def readSizeInfo(fileName):
+    allLines = None
+    with open(fileName) as f:
+        allLines = f.readlines()
+    sizeInfo = {}
+    for line in allLines:
+        tokens = line.split()
+        # assert(len(tokens) > 1) # Nodes with no size info are not getting output right now.
+        nodeName = tokens[0]
+        tokens = tokens[1:]
+        nodeOPSize = []
+        if (not tokens):
+            nodeOPSize = [1]
+        else:
+            for curDimStr in tokens:
+                if (curDimStr == ''):
+                    continue
+                nodeOPSize.append(int(curDimStr))
+        sizeInfo[nodeName] = nodeOPSize
+    return sizeInfo
+
+# Since later on, in the pipeline, the placeholder nodes which come up as cin statements
+# are to be excluded from the timing calculation, output all such PlaceHolder nodes together first.
+# This doesn't violate the topological ordering because all such PlaceHolder nodes are leaf nodes
+# in the graph.
+def prefixAllPlaceHolderNodes(graph):
+    allNodes = graph.getAllNodesRef()
+    placeHolderNodes = []
+    remNodes = []
+    for curNode in allNodes:
+        if (curNode.getOp() == "\"Placeholder\"" or curNode.getOp() == "\"VariableV2\""):
+            # Assert this is indeed a leaf node.
+            assert(len(curNode.getInputsRef()) == 0)
+            placeHolderNodes.append(curNode)
+        else:
+            remNodes.append(curNode)
+    graph.setNodesList(placeHolderNodes + remNodes)
+
+def main():
+    sys.setrecursionlimit(5000)
+    # First read the graph file.
+    # if (len(sys.argv) < 2):
+    #	print("FolderName unspecified.", file=sys.stderr)
+    #	exit(1)
+
+    # folderName = sys.argv[2]
+    graphFileName = os.path.join('graphDef.txt')
+    graph = Graph.Graph()
+    with open(graphFileName) as file:
+        graph.readFromFilePointer(file)
+
+    # # Read the sizeInfo also.
+    sizeInfoFileName = os.path.join('sizeInfo.txt')
+    sizeInfo = readSizeInfo(sizeInfoFileName)
+
+    # Place all PlaceHolder nodes together at the beginning.
+    prefixAllPlaceHolderNodes(graph)
+
+    # Re-format the input names of nodes.
+    for curNode in graph.getAllNodesRef():
+        inputsRef = curNode.getInputsRef()
+        for i, curInput in enumerate(inputsRef):
+            # TODO for training : Below is not correct.
+            # if (curInput.endswith(':1"')):
+            # 	inputsRef[i] = curInput.split(':1')[0] + '"'
+            if (curInput.startswith('"^')):
+                # My hypothesis from empirical observation is that inputs which have '^' ahead of the node name
+                # denote control flow dependency and not data dependency.
+                # For all purposes for this compilation, control and data dependency is considered same.
+                inputsRef[i] = '"' + curInput.split('^')[-1]
+
+    # Create extra info dict.
+    # Format : (sizeInfo)
+    extraInfoDict = {}
+    for k, v in sizeInfo.items():
+        extraInfoDict["\"" + k + "\""] = (v,)
+    for curNode in graph.getAllNodesRef():
+        if (curNode.getName() not in extraInfoDict):
+            extraInfoDict[curNode.getName()] = (None,)
+
+    print("Generating code from TF graph def : ", graphFileName, " ...")
+    (program, dictNodeNameToOutVarStr) = generateIRCode(graph, extraInfoDict)
+    # printAST = PrintAST()
+    # printAST.visit(program)
+
+    return program
+    # print("SeeDot AST generation done. Pickling the AST.")
+    # with open('astOutput.pkl', 'wb') as f:
+    #	pickle.dump(program, f)
+
+#    xx1 = countUniqueOps(graph)
+#    filename = "./fileGraphDef_LSTM"
+#    graph1 = Graph.Graph()
+#    with open(filename) as file:
+#        graph1.readFromFilePointer(file)
+#    xx2 = countUniqueOps(graph1)
+
+if __name__ == "__main__":
+    main()
--- a/tools/SeeDot/seedot/compiler/TF/TFNodesAST.py
+++ b/tools/SeeDot/seedot/compiler/TF/TFNodesAST.py
@ -0,0 +1,681 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+from enum import Enum, auto
+
+import seedot.compiler.ast.ast as AST
+
+import seedot.compiler.TF.Graph as Graph
+
+# Contains code for each of the TF nodes encountered in the benchmarks.
+# For each such TF node, outputs the corresponding SeeDot AST.
+
+
+class TFNodesAST:
+    class UninterpFuncCallNames(Enum):
+        '''
+        NOTE : SeeDot when compiling uninterpreted function calls, adds a new declaration for each uninterpreted function call.
+        '''
+        Input = auto()
+        CreateCopy = auto()
+        CreateIdentity = auto()
+        CreateTensor = auto()
+        # TODO : For assign node :: fix this after discussing with Aseem; hack right now.
+        CopyTensor = auto()
+        Const = auto()
+        Cast = auto()
+        TruncatedNormal = auto()
+        RandomUniform = auto()
+        Tile = auto()
+        MaxPool = auto()
+        Pack = auto()
+        Concat = auto()
+        ExpandDims = auto()
+        MaxPoolGrad = auto()
+        Conv2DBackpropInput = auto()
+        Conv2DBackpropFilter = auto()
+        AvgPool = auto()
+        Pad = auto()
+        Squeeze = auto()
+        TempFusedBatchNorm = auto()
+
+    def getOperatorsIdx(token):
+        # TODO : Remove usage of this.
+        return AST.Operators.convSymbolToEnumValue(token)
+
+    def MatMul(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        inp1Str = dictNodeNameToOutVarStr[inputsRef[0]]
+        inp2Str = dictNodeNameToOutVarStr[inputsRef[1]]
+        inp1AST = AST.ID(inp1Str)
+        inp2AST = AST.ID(inp2Str)
+
+        attrMapRef = curNode.getAttrMapRef()
+        transposeABool = transposeBBool = False
+        if ("\"transpose_a\"" in attrMapRef):
+            transposeABool = attrMapRef["\"transpose_a\""].getB()
+        if ("\"transpose_b\"" in attrMapRef):
+            transposeBBool = attrMapRef["\"transpose_b\""].getB()
+        if (transposeABool):
+            inp1AST = AST.Transp(inp1AST)
+        if (transposeBBool):
+            inp2AST = AST.Transp(inp2AST)
+        return (None, AST.BOp(inp1AST, TFNodesAST.getOperatorsIdx('*'), inp2AST))
+
+    def Placeholder(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # curNodeShapeLi = curNode.getAttrMapRef()["\"shape\""].getShape().getDimRef()
+        curNodeShapeLi = extraNodeInfoDict[curNode.getName()][0]
+        curNodeInputType = curNode.getAttrMapRef()["\"dtype\""].getDataType()
+        assert(curNodeInputType is not Graph.DataTypeEnum.DT_INVALID)
+        # TODO : There has to be some way to take range, understand the dimensions for SeeDot
+        # CHANGESRI
+        # return (None, AST.Input(curNodeShapeLi, curNodeInputType.name))
+        return (None, AST.Decl(curNodeShapeLi, (-0.1, 0.1)))
+
+    def Equal(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('=='),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                              ))
+
+    def Identity(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # In SeeDot, J2=J1 creates a new reference for J1 -- so
+        #	the corresponding code in Seedot cannot simply be J2 = J1.
+        #	Instead create a new tensor first and then assign the old one to the new one.
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+
+        curNodeDataType = curNode.getAttrMapRef()["\"T\""].getDataType()
+        assert(curNodeDataType is not Graph.DataTypeEnum.DT_INVALID)
+
+        curNodeShape = extraNodeInfoDict[curNode.getName()][0]
+        retAST = AST.UninterpFuncCall(curNodeShape,
+                                      TFNodesAST.UninterpFuncCallNames.CreateIdentity.name,
+                                      [AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])])
+        return (None, retAST)
+
+    def Add(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('+'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                              ))
+
+    def Mul(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('*'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                              ))
+
+    def Neg(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.UOp(TFNodesAST.getOperatorsIdx('-'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])
+                              ))
+
+    def Sub(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('+'),
+                              AST.UOp(TFNodesAST.getOperatorsIdx('-'),
+                                      AST.ID(
+                                  dictNodeNameToOutVarStr[inputsRef[1]])
+        )))
+
+    def Floor(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.Func(TFNodesAST.getOperatorsIdx('floor'), AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])))
+
+    def RealDiv(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('./'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                              ))
+
+    def FloorDiv(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        realDivAST = AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                             TFNodesAST.getOperatorsIdx('./'),
+                             AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                             )
+        return (None, AST.Func(TFNodesAST.getOperatorsIdx('floor'), realDivAST))
+
+    def VariableV2(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        curNodeShapeLi = curNode.getAttrMapRef(
+        )["\"shape\""].getShape().getDimRef()[:]
+        curNodeInputType = curNode.getAttrMapRef()["\"dtype\""].getDataType()
+
+        # TODO_TAB : for inference, have commented out decl and inserted input nodes.
+        # TODO : Right now in the current implementation, the dataType being passed to the node is being ignored by SeeDot.
+        #		 Fix this later.
+        # return (None, AST.Decl(curNodeShapeLi, curNodeInputType.name, None, None))
+        # NOTE : since this becomes an input node right now, i have also added to be prefixed at top in ProcessTFGraph::prefixAllPlaceHolderNodes()
+        # CHANGESRI
+        # return (None, AST.Input(curNodeShapeLi, curNodeInputType.name))
+        return (None, AST.Decl(curNodeShapeLi, [0.1, 0.1]))
+
+    def Assign(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        curNodeShape = extraNodeInfoDict[curNode.getName()][0]
+
+        # TODO_TAB : for inference, have commented the copyTensor function calls.
+        # TODO : Hack -- fix this later after discussing with Aseem.
+        # return (None, AST.UninterpFuncCall(curNodeShape,
+        # 									TFNodesAST.UninterpFuncCallNames.CopyTensor.name,
+        # 									[AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+        # 									AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])]))
+
+        return (None, None)
+
+    def Const(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        assert(len(curNode.getInputsRef()) == 0)
+        tensor = curNode.getAttrMapRef()["\"value\""].getTensor()
+        curNodeDataType = curNode.getAttrMapRef()["\"dtype\""].getDataType()
+        # Create a different copy to not change the original copy.
+        curNodeShape = tensor.getShapeRef()[:]
+
+        tensorConstantVal = tensor.getConstantVal()
+        if tensorConstantVal is not None:
+            # Use uinterpreted call of CreateTensor to create the tensor and fill it with a constant value.
+            dataPassed = None
+            if curNodeDataType == Graph.DataTypeEnum.DT_INT32:
+                dataPassed = AST.Int(tensorConstantVal, 32)
+            elif curNodeDataType == Graph.DataTypeEnum.DT_FLOAT:
+                dataPassed = AST.Float(tensorConstantVal)
+            else:
+                assert False
+
+            if (len(curNodeShape) == 0):
+                # This is a constant element.
+                retAST = dataPassed
+            else:
+                retAST = AST.UninterpFuncCall(curNodeShape,
+                                              TFNodesAST.UninterpFuncCallNames.CreateTensor.name,
+                                              [dataPassed],
+                                              isSecret=False)
+        else:
+            # The tensor content is given as byte array. Extract val array from the byte array and create AST.
+            if curNodeDataType == Graph.DataTypeEnum.DT_INT32:
+                dataPassed = list(map(lambda x: AST.Int(
+                    x, 32), tensor.getContentAsValArr()[:]))
+            elif curNodeDataType == Graph.DataTypeEnum.DT_FLOAT:
+                dataPassed = list(map(lambda x: AST.Float(
+                    x), tensor.getContentAsValArr()[:]))
+            else:
+                assert False
+            retAST = AST.Decl(curNodeShape, None, None,
+                              dataPassed, isSecret=False)
+        return (None, retAST)
+
+    def Relu(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.Func(TFNodesAST.getOperatorsIdx('relu'), AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])))
+
+    def ApplyGradientDescent(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 3)
+        inputTensor = AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])
+        learningRate = AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+        deltaTensor = AST.ID(dictNodeNameToOutVarStr[inputsRef[2]])
+        return (inputTensor,  AST.BOp(inputTensor,
+                                      TFNodesAST.getOperatorsIdx('+'),
+                                      AST.UOp(TFNodesAST.getOperatorsIdx('-'),
+                                              AST.BOp(learningRate,
+                                                      TFNodesAST.getOperatorsIdx(
+                                                          '.*'),
+                                                      deltaTensor))))
+
+    def Shape(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.Func(TFNodesAST.getOperatorsIdx('shape'), AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])))
+
+    def Cast(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        sourceType = curNode.getAttrMapRef()["\"SrcT\""].getDataType()
+        destType = curNode.getAttrMapRef()["\"DstT\""].getDataType()
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Cast.name,
+                                           [AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                            AST.ID(
+                                               sourceType.name),
+                                            AST.ID(
+                                               destType.name)
+                                            ]))
+
+    def ZerosLike(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        curNodeOutputType = curNode.getAttrMapRef()["\"T\""].getDataType()
+        assert(curNodeOutputType is not Graph.DataTypeEnum.DT_INVALID)
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.CreateTensor.name,
+                                      [AST.Int(0)],
+                                      isSecret=False)
+        return (None, retAST)
+
+    def Fill(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        curNodeOutputShape = extraNodeInfoDict[inputsRef[0]][0]
+        # inputsRef[0] denotes a shape and should have a rank of 1.
+        assert(len(curNodeOutputShape) == 1)
+
+        curNodeOutputType = curNode.getAttrMapRef()["\"T\""].getDataType()
+        assert(curNodeOutputType is not Graph.DataTypeEnum.DT_INVALID)
+
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.CreateTensor.name,
+                                      [AST.ID(
+                                          dictNodeNameToOutVarStr[inputsRef[1]])],
+                                      isSecret=False)
+        return (None, retAST)
+
+    def TruncatedNormal(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        curNodeDataType = curNode.getAttrMapRef()["\"dtype\""].getDataType()
+        assert(curNodeDataType is not Graph.DataTypeEnum.DT_INVALID)
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        curNodeOutputShape = extraNodeInfoDict[curNode.getName()][0]
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.TruncatedNormal.name,
+                                           [AST.ID(curNodeDataType.name)]
+                                           + list(map(lambda x: AST.Int(x), curNodeOutputShape))
+                                           ))  # TODO
+
+    def RandomUniform(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        curNodeDataType = curNode.getAttrMapRef()["\"dtype\""].getDataType()
+        assert(curNodeDataType is not Graph.DataTypeEnum.DT_INVALID)
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        curNodeOutputShape = extraNodeInfoDict[curNode.getName()][0]
+        return (None, AST.UninterpFuncCall(curNodeOutputShape,
+                                           TFNodesAST.UninterpFuncCallNames.RandomUniform.name,
+                                           [AST.ID(curNodeDataType.name)]))
+
+    def Maximum(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]), TFNodesAST.getOperatorsIdx('max'), AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])))
+
+    def Reshape(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.Reshape(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]), extraNodeInfoDict[curNode.getName()][0], None))
+
+    def Conv2D(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+
+        options = {}
+        # TODO : Parse other options and make sure backend is consuming those.
+        # Other options left to parse include T, data_format, dilations.
+
+        paddingUsed = curNode.getAttrMapRef()["\"padding\""].getS()
+        if (paddingUsed == "\"SAME\""):
+            options["padding"] = 0
+        elif (paddingUsed == "\"VALID\""):
+            options["padding"] = 1
+        else:
+            options["padding"] = -1
+
+        stridesUsed = curNode.getAttrMapRef()["\"strides\""].getList().getILi()
+        options["strides"] = stridesUsed
+
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('#'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                              options))
+
+    def MaxPool(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+
+        options = {}
+
+        stridesUsed = curNode.getAttrMapRef()["\"strides\""].getList().getILi()
+        assert((stridesUsed[0] == 1) and (stridesUsed[3] == 1))
+        strideH = stridesUsed[1]
+        strideW = stridesUsed[2]
+
+        kSizeUsed = curNode.getAttrMapRef()["\"ksize\""].getList().getILi()
+        assert((kSizeUsed[0] == 1) and (kSizeUsed[3] == 1))
+        kSizeH = kSizeUsed[1]
+        kSizeW = kSizeUsed[2]
+
+        paddingUsedStr = curNode.getAttrMapRef()["\"padding\""].getS()
+        zPadH = zPadW = -1
+        if (paddingUsedStr == "\"SAME\""):
+            zPadH = int((kSizeH - 1)/2)
+            zPadW = int((kSizeW - 1)/2)
+        elif (paddingUsedStr == "\"VALID\""):
+            zPadH = zPadW = 0
+        else:
+            zPadH = zPadW = -1
+
+        inputShape = extraNodeInfoDict[inputsRef[0]][0]
+        imgH = inputShape[1]
+        imgW = inputShape[2]
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.MaxPool.name,
+                                           [AST.Int(kSizeH, 32), AST.Int(kSizeW, 32),
+                                            AST.Int(zPadH, 32), AST.Int(
+                                                zPadW, 32),
+                                            AST.Int(strideH, 32), AST.Int(
+                                               strideW, 32),
+                                            AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])]
+                                           ))
+
+    def Pack(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        N = curNode.getAttrMapRef()["\"N\""].getI()
+        axis = curNode.getAttrMapRef()["\"axis\""].getI()
+        assert(len(inputsRef) == N)
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.Pack.name,
+                                      list(map(lambda x: AST.ID(dictNodeNameToOutVarStr[x]), inputsRef)) + [AST.Int(axis)])
+        return (None, retAST)
+
+    def ConcatV2(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        N = curNode.getAttrMapRef()["\"N\""].getI()
+        assert(len(inputsRef) == N+1)  # One extra for axis
+        # TODO : Since the axis of concat is constant, therefore, its known here - the input's sizes along that dim should be
+        #		passed as input to the below function.
+        #		For now hardcoding.
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.Concat.name +
+                                      str(N) + 'T',
+                                      list(map(lambda x: AST.ID(
+                                          dictNodeNameToOutVarStr[x]), inputsRef)),
+                                      outputDiffInpDims=1
+                                      )
+        return (None, retAST)
+
+    def ExpandDims(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.ExpandDims.name,
+                                      list(map(lambda x: AST.ID(dictNodeNameToOutVarStr[x]), inputsRef)))
+        return (None, retAST)
+
+    def Slice(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 3)
+        curNodeDataType = curNode.getAttrMapRef()["\"T\""].getDataType()
+        curNodeShapeASTLi = list(map(lambda x: AST.Int(
+            x), extraNodeInfoDict[curNode.getName()][0]))
+        retAST = AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                      TFNodesAST.UninterpFuncCallNames.CreateCopy.name,
+                                      [AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),  # of this
+                                       # begin idx
+                                       AST.ID(
+                                           dictNodeNameToOutVarStr[inputsRef[1]]),
+                                       # size
+                                       AST.ID(
+                                           dictNodeNameToOutVarStr[inputsRef[2]])
+                                       ])
+        return (None, retAST)
+
+    def Tile(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Tile.name,
+                                           [AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]), AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])]))
+
+    def ShapeN(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # TODO : generalize -- remove usage of Declare.
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        N = curNode.getAttrMapRef()["\"N\""].getI()
+        assert(N == 2)
+        # TODO
+        # curNodeShape = extraNodeInfoDict[curNode.getName()][0]
+        # curNodeDataType = curNode.getAttrMapRef()["\"T\""].getDataType()
+        # retAST = AST.Let(AST.ID('temp_shapen_1'), AST.Declare(list(map(lambda x : AST.Int(x), curNodeShape)), AST.ID(curNodeDataType.name)), None)
+        # retAST.expr = AST.Let(AST.Index(AST.ID('temp_shapen_1'), [AST.Int(0)]),
+        # 					  AST.Func(TFNodesAST.getOperatorsIdx('shape'), AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])),
+        # 					  None)
+        # retAST.expr.expr = AST.Let(AST.Index(AST.ID('temp_shapen_1'), [AST.Int(1)]),
+        # 						   AST.Func(TFNodesAST.getOperatorsIdx('shape'), AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])),
+        # 						   AST.ID('temp_shapen_1'))
+
+        return (None, None)
+
+    def Sum(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.Reduce(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                 AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                                 TFNodesAST.getOperatorsIdx('+')))
+
+    def Prod(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.Reduce(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                 AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                                 TFNodesAST.getOperatorsIdx('*')))
+
+    def Mean(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.Reduce(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                 AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                                 TFNodesAST.getOperatorsIdx('mean')))
+
+    def ArgMax(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.ArgMax(extraNodeInfoDict[curNode.getName()][0],
+                                 AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                 AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                                 extraNodeInfoDict[inputsRef[0]][0]))
+
+    def LogSoftmax(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        expAST = AST.Func(TFNodesAST.getOperatorsIdx('exp'),
+                          AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]))
+        reduceAST = AST.Reduce(expAST, AST.Int(-1),
+                               TFNodesAST.getOperatorsIdx('+'))
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('+'),
+                              AST.UOp(TFNodesAST.getOperatorsIdx('-'), AST.Func(TFNodesAST.getOperatorsIdx('log'), reduceAST))))
+
+    def StopGradient(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]))
+
+    def SoftmaxCrossEntropyWithLogits(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # Input1 is logits and Input2 is the one-hot encoding true distribution.
+        # Calculate softmax on input1 and cross-entropy between that (p(x)) and true-distribution (q(x)).
+        # Cross-entropy = \summation_x{-q(x)*log(p(x))}.
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        logitsInpt = AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])
+        labelsInpt = AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+
+        # Reduce along column to get row-vector.
+        # TODO : softmax or implement here ?
+        retAST = AST.Let(AST.ID('temp_softmax'), AST.Func(
+            TFNodesAST.getOperatorsIdx('softmax'), logitsInpt), None)
+        retAST.expr = AST.Let(AST.ID('temp_1'),
+                              AST.UOp(TFNodesAST.getOperatorsIdx('-'),
+                                      AST.Reduce(AST.BOp(labelsInpt,
+                                                         TFNodesAST.getOperatorsIdx(
+                                                             '.*'),
+                                                         AST.Func(TFNodesAST.getOperatorsIdx('log'), AST.ID('temp_softmax'))),
+                                                 1, TFNodesAST.getOperatorsIdx('+'))),
+                              AST.ID('temp_1'))
+        return (None, retAST)
+
+    def BroadcastGradientArgs(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        return (None, AST.ID("temp"))  # TODO
+
+    def ReluGrad(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.Cond(AST.ID(dictNodeNameToOutVarStr[inputsRef[1]]),
+                               AST.Int(1),
+                               AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                               AST.Int(0)))
+
+    def MaxPoolGrad(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.MaxPoolGrad.name,
+                                           list(map(lambda x: AST.ID(dictNodeNameToOutVarStr[x]), inputsRef))))
+
+    def Conv2DBackpropInput(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Conv2DBackpropInput.name,
+                                           list(map(lambda x: AST.ID(dictNodeNameToOutVarStr[x]), inputsRef))))
+
+    def Conv2DBackpropFilter(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Conv2DBackpropFilter.name,
+                                           list(map(lambda x: AST.ID(dictNodeNameToOutVarStr[x]), inputsRef))))
+
+    def NoOp(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        return (None, None)
+
+    def Square(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('.*'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])
+                              ))
+
+    def AvgPool(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 1)
+
+        options = {}
+
+        stridesUsed = curNode.getAttrMapRef()["\"strides\""].getList().getILi()
+        assert((stridesUsed[0] == 1) and (stridesUsed[3] == 1))
+        strideH = stridesUsed[1]
+        strideW = stridesUsed[2]
+
+        kSizeUsed = curNode.getAttrMapRef()["\"ksize\""].getList().getILi()
+        assert((kSizeUsed[0] == 1) and (kSizeUsed[3] == 1))
+        kSizeH = kSizeUsed[1]
+        kSizeW = kSizeUsed[2]
+
+        paddingUsedStr = curNode.getAttrMapRef()["\"padding\""].getS()
+        zPadH = zPadW = -1
+        if (paddingUsedStr == "\"SAME\""):
+            zPadH = int((kSizeH - 1)/2)
+            zPadW = int((kSizeW - 1)/2)
+        elif (paddingUsedStr == "\"VALID\""):
+            zPadH = zPadW = 0
+        else:
+            zPadH = zPadW = -1
+
+        inputShape = extraNodeInfoDict[inputsRef[0]][0]
+        imgH = inputShape[1]
+        imgW = inputShape[2]
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.AvgPool.name,
+                                           [AST.Int(kSizeH, 32), AST.Int(kSizeW, 32),
+                                            AST.Int(zPadH, 32), AST.Int(
+                                                zPadW, 32),
+                                            AST.Int(strideH, 32), AST.Int(
+                                               strideW, 32),
+                                            AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])]
+                                           ))
+
+    def Pad(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # Mode refers to 'CONSTANT', 'REFLECT' or 'SYMMETRIC'.
+        mode = 0
+        if ("\"mode\"" in curNode.getAttrMapRef()):
+            mode = curNode.getAttrMapRef()["\"mode\""].getI()
+
+        constant_values = 0
+        if ("\"constant_values\"" in curNode.getAttrMapRef()):
+            constant_values = curNode.getAttrMapRef()[
+                "\"constant_values\""].getI()
+
+        # For now to make life easy - deal with SYMMETRIC AND REFLECT when time comes.
+        assert(mode == 0 and constant_values == 0)
+        inputsRef = curNode.getInputsRef()
+        inputTensorShapeLi = extraNodeInfoDict[inputsRef[0]][0]
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Pad.name,
+                                           [
+            AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+            AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+        ],
+            outputDiffInpDims=1
+        ))
+
+    def FusedBatchNorm(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # NOTE : Since the weights to this layer will be scaled appropriately, this op will become identity.
+        inputsRef = curNode.getInputsRef()
+
+        # TODO : This below thing is the right way of implementing the operator
+        #		For now using uninterpreted function call.
+        # tempAst = AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+        # 					TFNodesAST.getOperatorsIdx('*'),
+        # 					AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+        # 					)
+        # return (None, AST.BOp(tempAst,
+        # 					TFNodesAST.getOperatorsIdx('+'),
+        # 					AST.ID(dictNodeNameToOutVarStr[inputsRef[2]])
+        # 					))
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.TempFusedBatchNorm.name,
+                                           [AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                                            AST.ID(
+                                               dictNodeNameToOutVarStr[inputsRef[1]]),
+                                            AST.ID(
+                                               dictNodeNameToOutVarStr[inputsRef[2]]),
+                                            ]
+                                           ))
+
+    def Squeeze(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        # TODO : Do this in somewhat better way.
+        inputsRef = curNode.getInputsRef()
+        inputTensorShape = extraNodeInfoDict[inputsRef[0]][0]
+        inputTensorRank = len(inputTensorShape)
+
+        squeezeDims = curNode.getAttrMapRef(
+        )["\"squeeze_dims\""].getList().getILi()
+        squeezeDimsRank = len(squeezeDims)
+
+        return (None, AST.UninterpFuncCall(extraNodeInfoDict[curNode.getName()][0],
+                                           TFNodesAST.UninterpFuncCallNames.Squeeze.name,
+                                           list(map(lambda x: AST.Int(x, 32), squeezeDims)) +
+                                           [
+            AST.ID(dictNodeNameToOutVarStr[inputsRef[0]])
+        ]
+        ))
+
+    def BiasAdd(graph: Graph.Graph, curNode: Graph.Node, dictNodeNameToOutVarStr: dict, extraNodeInfoDict: dict):
+        inputsRef = curNode.getInputsRef()
+        assert(len(inputsRef) == 2)
+        return (None, AST.BOp(AST.ID(dictNodeNameToOutVarStr[inputsRef[0]]),
+                              TFNodesAST.getOperatorsIdx('+'),
+                              AST.ID(dictNodeNameToOutVarStr[inputsRef[1]])
+                              ))
--- a/Показать больше
+++ b/Показать больше