Initial Commit

2019-02-11 15:06:58 -08:00 · 2019-02-11 15:06:58 -08:00 · 5575701178
--- a/Assets/Barracuda.Core.meta
+++ b/Assets/Barracuda.Core.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 6d3b29b401b2a4ec893b605b781f8569
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda.md
+++ b/Assets/Barracuda.Core/Barracuda.md
@ -0,0 +1,166 @@
 <!---TODO:
 	Advanced topics
 	* worker.AddInput(): to prewarm data
 	* how to trim networks at runtime (multi brain models)
 	* loading model from url: var modelFromDiskOrInternet = ModelLoader.Load(url, verbose); // will download and cache model from url
 	* recurrent state
 --->
 # Barracuda
 **Barracuda** is a lightweight and **cross-platform** Neural Net **inference library for Unity**. Barracuda can execute both on GPU and CPU. Currently Barracuda is in the early development stage, so adventures are expected.
 ## Using Barracuda
 Typically the following steps are needed to use Barracuda in application:
 1. load model,
 2. create inference engine (the worker),
 3. execute model and
 4. fetch results.
 But first you have to convert your TensorFlow (or ONNX) model to Barracuda format with python scripts. Example usage:
 ```bash
 python onnx_to_barracuda.py Models/mnist/model.onnx Destination/mnist.bytes
 ```
 See _Converting models to Barracuda_ paragraph below for more information.
 ### Load Model into Barracuda
 Once you have your TensorFlow (or ONNX) model converted, you can load resulting Barracuda file via `ModelLoader`:
 ```C#
 var model = ModelLoader.LoadFromStreamingAssets(modelName + ".bytes");
 ```
 ### Create inference engine (Worker)
 Inference engine in Barracuda is called Worker. Worker is responsible for converting model into executable tasks and scheduling them on GPU or CPU.
 ```C#
 var worker = BarracudaWorkerFactory.CreateWorker(BarracudaWorkerFactory.Type.ComputeFast, model)
 ```
 ### Execute the model
 Inputs can be provided both as sole `Tensor` object (assuming Model has only one input) or as a dictionary of name and `Tensor` pairs.
 ```C#
 var inputs = new Dictionary<string, Tensor>();
 inputs[name1] = new Tensor(...);
 inputs[name2] = new Tensor(...);
 worker.Execute(inputs);
 ```
 Execution is asynchronous for GPU backends. Currently implementation is synchronous for CPU backends, however it is good to assume that execution will be async for all backends in the future.
 ### Fetch outputs
 If model has only single output, then simple `worker.Fetch()` can be used, otherwise output names should be provided.
 ```C#
 var O = worker.Fetch(outputName);
 ```
 ### Cleanup
 As a Barracuda client you are responsible to `Dispose` _worker_, _inputs_ and _outputs_ you fetched. This is necessary to properly free GPU resources.
 ```C#
 O.Dispose();
 worker.Dispose();
 ```
 ## Working with data
 ### Tensor
 Barracuda stores data in `batch`,`height`,`width`,`channels` also known as _NHWC_ or _channels-last_ format. You can interact with `Tensor` data via multi-dimensional array operators:
 ```C#
 var tensor = new Tensor(batchCount, height, width, channelCount);
 tensor[n, y, x, c] = 1.0f; // as N batches of 3 dimensional data: N x {X, Y, C}
 tensor[n,       c] = 2.0f; // as N batches of 1 dimensional data: N x {C}
 tensor[         i] = 3.0f; // as flat array
 ```
 There are number of `Tensor` constructors that cover variety of scenarios. By default tensors are initialized with `0` upon construction, unless intialization `Array` is provided.
 ```C#
 tensor = new Tensor(batchCount, height, width, channelCount);    // batch of 3 dimensional data, 0 initialized: batchCount x {height, width, channelCount}
 tensor = new Tensor(batchCount, elementCount);                   // batch of 1 dimensional data, 0 initialized: batchCount x {elementCount}
 var stridedArray = new float[batchCount * elementCount] { ... };
 tensor = new Tensor(batchCount, elementCount, stridedArray);     // batch of 1 dimensional data, initialized from strided array
 var jaggedArray = new float[batchCount][elementCount] { ... };
 tensor = new Tensor(batchCount, elementCount, jaggedArray);      // batch of 1 dimensional data, initialized from jagged array
 Texture2D texture = ...;
 tensor = new Tensor(texture);                                    // tensor initialized with texture data: 1 x { texture.width, texture.height, 3}
 ```
 You can query shape of the `Tensor` object, but you can not change it. Shape of the `Tensor` is immutable. If you want to have different shape of `Tensor`, you have to construct the new instance of `Tensor` object.
 ```C#
 var shape = tensor.shape;
 Debug.Log(shape + " or " + shape.batch + shape.height + shape.width + shape.channels);
 ```
 ### Texture as input
 You can directly pass `Texture2D`, `Texture2DArray`, `Texture3D` or `RenderTexture` to Barracuda without accessing individual pixels on CPU:
 ```C#
 var channelCount = 3; // you can treat input pixels as 1 (grayscale), 3 (color) or 4 (color with alpha) channels
 var tensor = new Tensor(texture, channelCount);
 ```
 You can batch multiple textures into the single `Tensor` object:
 ```C#
 var textures = new [] { texture0, texture1, texture2, texture3 }; // these textures will form a batch
 var tensor = new Tensor(textures, channelCount);
 ```
 Note that to form a batch all textures must have the same width and height dimensions.
 ### Texture as output
 If you want to use Barracuda execution results further in the graphics pipeline, you can copy data from `Tensor` into `RenderTexture` without stalling CPU or GPU:
 ```C#
 	var tensor = worker.Fetch();
 	var texture = BarracudaTextureUtils.TensorToRenderTexture(tensor);
 ```
 If you wish, you can reuse the same `RenderTexture` multiple times:
 ```C#
 	var texture = new RenderTexture(width, height, 0);
 	// ...
 	var tensor = worker.Fetch();
 	BarracudaTextureUtils.TensorToRenderTexture(tensor, texture);
 ```
 ## Introspecting Barracuda models
 Barracuda model has very simple memory representation. Once model is loaded you can query for inputs and outputs:
 ```C#
 string[] inputNames = model.inputs;   // query model inputs
 string[] outputNames = model.outputs; // query model outputs
 ```
 Or you can directly iterate through the layers and investigate what model is going to do:
 ```C#
 foreach (var layer in model.layers)
 	Debug.Log(layer.name + " does " + layer.type);
 ```
 ## Verbose mode
 You can turn on verbose mode for different parts of Barracuda:
 ```C#
 bool verbose = true;
 var model = ModelLoader.LoadFromStreamingAssets(modelName + ".bytes", verbose); // verbose loader
 var worker = BarracudaWorkerFactory.CreateWorker(BarracudaWorkerFactory.Type.ComputeFast, model, verbose); // verbose execution
 ```
 ## Converting TensorFlow and ONNX models to Barracuda format
 Barracuda comes with dedicated python scripts to convert pre-trained TensorFlow and ONNX models to Barracuda format.
 Convert from TensorFlow:
 ```bash
 python tensorflow_to_barracuda.py Models/3DBall-tf-model.pb Destination/3DBall-bc.bytes
 ```
 Convert from ONNX:
 ```bash
 python onnx_to_barracuda.py Models/mnist/model.onnx Destination/mnist-bc.bytes
 ```
 If network has multiple outputs, but you need only particular ones during the inference, there is an optional `-trim` flag to remove unused outputs and calculations.
 For example:
 ```bash
 python tensorflow_to_barracuda.py Models/3DBall-tf-model.pb Destination/3DBall-bc.bytes -trim action$
 ```
 Trim will first remove outputs that do not match regular expression from the graph. In this case only output that ends with `action` will be left.
 Next trim will strip all nodes that do not participate in the evaluation of the output.
 P.S. Python 3.5 or 3.6 is recommended
 P.P.S. We plan to migrate Tensorflow and ONNX converters from Python to C# in the future.
--- a/Assets/Barracuda.Core/Barracuda.md.meta
+++ b/Assets/Barracuda.Core/Barracuda.md.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 3cf2bcd7dcfe144bebf6cf271e7dfbe0
 TextScriptImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda.meta
+++ b/Assets/Barracuda.Core/Barracuda.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 4d59cec597ba94288831c0cade38b14e
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Barracuda.dll
+++ b/Assets/Barracuda.Core/Barracuda/Barracuda.dll
--- a/Assets/Barracuda.Core/Barracuda/Barracuda.dll.meta
+++ b/Assets/Barracuda.Core/Barracuda/Barracuda.dll.meta
@ -0,0 +1,30 @@
 fileFormatVersion: 2
 guid: de59cc66e5e394f93b2a692e50bce97f
 PluginImporter:
  externalObjects: {}
  serializedVersion: 2
  iconMap: {}
  executionOrder: {}
  isPreloaded: 0
  isOverridable: 0
  platformData:
  - first:
      Any: 
    second:
      enabled: 1
      settings: {}
  - first:
      Editor: Editor
    second:
      enabled: 0
      settings:
        DefaultValueInitialized: true
  - first:
      Windows Store Apps: WindowsStoreApps
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: a7bba248e968b476a875260a8127a595
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 5087a463bec2b4b76808e7307a94887f
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.asmdef
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.asmdef
@ -0,0 +1,11 @@
 {
    "name": "MacBLAS",
    "references": [],
    "optionalUnityReferences": [],
    "includePlatforms": [
        "Editor",
        "macOSStandalone"
    ],
    "excludePlatforms": [],
    "allowUnsafeCode": true
 }
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.asmdef.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.asmdef.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 53fc9961397934ed38a573ce1392c80c
 AssemblyDefinitionImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.cs
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.cs
@ -0,0 +1,29 @@
 #if UNITY_STANDALONE_OSX || UNITY_EDITOR_OSX
 using System.Runtime.InteropServices;
 using Barracuda;
 using UnityEngine;
 using UnityEngine.Scripting;
 [Preserve]
 public class MacBLAS : BLASPlugin
 {
    [DllImport("macblas")]
    static extern unsafe void macsgemm(float* Ap, int AN, int AM, 
                                        float* Bp, int BN, int BM, 
                                        float* Cp, int CN, int CM, 
                                        int bs, bool transposeA, bool transposeB);
    public bool IsCurrentPlatformSupported()
    {
        return Application.platform == RuntimePlatform.OSXEditor || 
               Application.platform == RuntimePlatform.OSXPlayer;
    }
    public unsafe void SGEMM(float* Ap, int AN, int AM, float* Bp, int BN, int BM, float* Cp, int CN, int CM, int bs,
        bool transposeA = false, bool transposeB = false)
    {
        macsgemm(Ap, AN, AM, Bp, BN, BM, Cp, CN, CM, bs, transposeA, transposeB);
    }
 }
 #endif // UNITY_OSX
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.cs.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/MacBLAS.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: 680f04373f71f48a89408105d3f58a08
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle.meta
@ -0,0 +1,40 @@
 fileFormatVersion: 2
 guid: 6633afded85ec4f00a4cc653053461bb
 folderAsset: yes
 PluginImporter:
  externalObjects: {}
  serializedVersion: 2
  iconMap: {}
  executionOrder: {}
  isPreloaded: 0
  isOverridable: 0
  platformData:
  - first:
      '': OSXIntel
    second:
      enabled: 1
      settings: {}
  - first:
      '': OSXIntel64
    second:
      enabled: 1
      settings: {}
  - first:
      Any: 
    second:
      enabled: 0
      settings: {}
  - first:
      Editor: Editor
    second:
      enabled: 1
      settings:
        DefaultValueInitialized: true
  - first:
      Standalone: OSXUniversal
    second:
      enabled: 1
      settings: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 5de42c62131964fc999e1dc3d292cc31
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/Info.plist
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/Info.plist
@ -0,0 +1,40 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 <plist version="1.0">
 <dict>
 	<key>BuildMachineOSBuild</key>
 	<string>14F27</string>
 	<key>CFBundleDevelopmentRegion</key>
 	<string>en</string>
 	<key>CFBundleExecutable</key>
 	<string>macblas</string>
 	<key>CFBundleIdentifier</key>
 	<string>com.unity3d.macblas</string>
 	<key>CFBundleInfoDictionaryVersion</key>
 	<string>6.0</string>
 	<key>CFBundleName</key>
 	<string>macblas</string>
 	<key>CFBundlePackageType</key>
 	<string>BNDL</string>
 	<key>CFBundleShortVersionString</key>
 	<string>0.1.4</string>
 	<key>CFBundleVersion</key>
 	<string>1</string>
 	<key>DTCompiler</key>
 	<string>com.apple.compilers.llvm.clang.1_0</string>
 	<key>DTPlatformBuild</key>
 	<string>6A1052d</string>
 	<key>DTPlatformVersion</key>
 	<string>GM</string>
 	<key>DTSDKBuild</key>
 	<string>14A382</string>
 	<key>DTSDKName</key>
 	<string>macosx10.10</string>
 	<key>DTXcode</key>
 	<string>0610</string>
 	<key>DTXcodeBuild</key>
 	<string>6A1052d</string>
 	<key>NSHumanReadableCopyright</key>
 	<string>Copyright © 2018 Unity Technologies. All rights reserved.</string>
 </dict>
 </plist>
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/Info.plist.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/Info.plist.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 844f003f25d444aafad9fb1fcea17bbc
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 0620b207d80004fe595413acf79f2f66
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS/macblas
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS/macblas
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS/macblas.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/MacOS/macblas.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: e9ef2c9e25cad478aa1220d6cf68a2ed
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 93038b433855548879a151644d2354c1
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature/CodeResources
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature/CodeResources
@ -0,0 +1,105 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 <plist version="1.0">
 <dict>
 	<key>files</key>
 	<dict/>
 	<key>files2</key>
 	<dict/>
 	<key>rules</key>
 	<dict>
 		<key>^Resources/</key>
 		<true/>
 		<key>^Resources/.*\.lproj/</key>
 		<dict>
 			<key>optional</key>
 			<true/>
 			<key>weight</key>
 			<real>1000</real>
 		</dict>
 		<key>^Resources/.*\.lproj/locversion.plist$</key>
 		<dict>
 			<key>omit</key>
 			<true/>
 			<key>weight</key>
 			<real>1100</real>
 		</dict>
 		<key>^version.plist$</key>
 		<true/>
 	</dict>
 	<key>rules2</key>
 	<dict>
 		<key>.*\.dSYM($|/)</key>
 		<dict>
 			<key>weight</key>
 			<real>11</real>
 		</dict>
 		<key>^(.*/)?\.DS_Store$</key>
 		<dict>
 			<key>omit</key>
 			<true/>
 			<key>weight</key>
 			<real>2000</real>
 		</dict>
 		<key>^(Frameworks|SharedFrameworks|PlugIns|Plug-ins|XPCServices|Helpers|MacOS|Library/(Automator|Spotlight|LoginItems))/</key>
 		<dict>
 			<key>nested</key>
 			<true/>
 			<key>weight</key>
 			<real>10</real>
 		</dict>
 		<key>^.*</key>
 		<true/>
 		<key>^Info\.plist$</key>
 		<dict>
 			<key>omit</key>
 			<true/>
 			<key>weight</key>
 			<real>20</real>
 		</dict>
 		<key>^PkgInfo$</key>
 		<dict>
 			<key>omit</key>
 			<true/>
 			<key>weight</key>
 			<real>20</real>
 		</dict>
 		<key>^Resources/</key>
 		<dict>
 			<key>weight</key>
 			<real>20</real>
 		</dict>
 		<key>^Resources/.*\.lproj/</key>
 		<dict>
 			<key>optional</key>
 			<true/>
 			<key>weight</key>
 			<real>1000</real>
 		</dict>
 		<key>^Resources/.*\.lproj/locversion.plist$</key>
 		<dict>
 			<key>omit</key>
 			<true/>
 			<key>weight</key>
 			<real>1100</real>
 		</dict>
 		<key>^[^/]+$</key>
 		<dict>
 			<key>nested</key>
 			<true/>
 			<key>weight</key>
 			<real>10</real>
 		</dict>
 		<key>^embedded\.provisionprofile$</key>
 		<dict>
 			<key>weight</key>
 			<real>20</real>
 		</dict>
 		<key>^version\.plist$</key>
 		<dict>
 			<key>weight</key>
 			<real>20</real>
 		</dict>
 	</dict>
 </dict>
 </plist>
--- a/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature/CodeResources.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/OSX/macblas.bundle/Contents/_CodeSignature/CodeResources.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 523ab7e7760c743a9977ecfedabe1691
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 256085e1b062345239f3d7d88741f96c
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.asmdef
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.asmdef
@ -0,0 +1,11 @@
 {
    "name": "iOSBLAS",
    "references": [],
    "optionalUnityReferences": [],
    "includePlatforms": [
        "Editor",
        "iOS"
    ],
    "excludePlatforms": [],
    "allowUnsafeCode": true
 }
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.asmdef.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.asmdef.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 005937e819cd540429ad05eabcfb642f
 AssemblyDefinitionImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.cs
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.cs
@ -0,0 +1,27 @@
 #if UNITY_IOS
 using System.Runtime.InteropServices;
 using Barracuda;
 using UnityEngine;
 using UnityEngine.Scripting;
 [Preserve]
 public class iOSBLAS : BLASPlugin
 {
    [DllImport("__Internal")]
    static extern unsafe void iossgemm(float* Ap, int AN, int AM, 
                                        float* Bp, int BN, int BM, 
                                        float* Cp, int CN, int CM, 
                                        int bs, bool transposeA, bool transposeB);
    public bool IsCurrentPlatformSupported()
    {
        return Application.platform == RuntimePlatform.IPhonePlayer;
    }
    public unsafe void SGEMM(float* Ap, int AN, int AM, float* Bp, int BN, int BM, float* Cp, int CN, int CM, int bs,
        bool transposeA = false, bool transposeB = false)
    {
        iossgemm(Ap, AN, AM, Bp, BN, BM, Cp, CN, CM, bs, transposeA, transposeB);
    }
 }
 #endif // UNITY_IOS
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.cs.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: 75424b0c6afc14ea7a1debef68240d9e
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.mm
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.mm
@ -0,0 +1,15 @@
 #import <Accelerate/Accelerate.h>
 extern "C"
 {
 void iossgemm(float* Ap, int AN, int AM,
 			  float* Bp, int BN, int BM,
 			  float* Cp, int CN, int CM,
 			  int bs, bool transposeA, bool transposeB)
 	{
 		cblas_sgemm(CblasRowMajor, transposeA ? CblasTrans : CblasNoTrans,
 					transposeB ? CblasTrans : CblasNoTrans,
 					AN, BM, BN, 1.0f, Ap, AM, Bp, BM, 1.0f, Cp, CM);
 	}
 }
--- a/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.mm.meta
+++ b/Assets/Barracuda.Core/Barracuda/Plugins/iOS/iOSBLAS.mm.meta
@ -0,0 +1,102 @@
 fileFormatVersion: 2
 guid: 100b08f95d9f349118f287b0170140d4
 PluginImporter:
  externalObjects: {}
  serializedVersion: 2
  iconMap: {}
  executionOrder: {}
  isPreloaded: 0
  isOverridable: 0
  platformData:
  - first:
      '': Any
    second:
      enabled: 0
      settings:
        Exclude Android: 1
        Exclude Editor: 1
        Exclude Linux: 1
        Exclude Linux64: 1
        Exclude LinuxUniversal: 1
        Exclude OSXUniversal: 1
        Exclude WebGL: 1
        Exclude Win: 1
        Exclude Win64: 1
        Exclude iOS: 0
  - first:
      Android: Android
    second:
      enabled: 0
      settings:
        CPU: ARMv7
  - first:
      Any: 
    second:
      enabled: 0
      settings: {}
  - first:
      Editor: Editor
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
        DefaultValueInitialized: true
        OS: AnyOS
  - first:
      Facebook: Win
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  - first:
      Facebook: Win64
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  - first:
      Standalone: Linux
    second:
      enabled: 0
      settings:
        CPU: x86
  - first:
      Standalone: Linux64
    second:
      enabled: 0
      settings:
        CPU: x86_64
  - first:
      Standalone: OSXUniversal
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  - first:
      Standalone: Win
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  - first:
      Standalone: Win64
    second:
      enabled: 0
      settings:
        CPU: AnyCPU
  - first:
      iPhone: iOS
    second:
      enabled: 1
      settings:
        AddToEmbeddedBinaries: false
        CompileFlags: 
        FrameworkDependencies: Accelerate;
  - first:
      tvOS: tvOS
    second:
      enabled: 1
      settings: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 264a957219ea041c58af860601fe1881
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Activation.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Activation.compute
@ -0,0 +1,679 @@
 #pragma kernel Relu
 #pragma kernel Relu_CNyx
 #pragma kernel Relu_Nyxc
 #pragma kernel Relu6
 #pragma kernel Relu6_CNyx
 #pragma kernel Relu6_Nyxc
 #pragma kernel Tanh
 #pragma kernel Tanh_CNyx
 #pragma kernel Tanh_Nyxc
 #pragma kernel Swish
 #pragma kernel Swish_CNyx
 #pragma kernel Swish_Nyxc
 #pragma kernel Sigmoid
 #pragma kernel Sigmoid_CNyx
 #pragma kernel Sigmoid_Nyxc
 #pragma kernel Elu
 #pragma kernel Elu_CNyx
 #pragma kernel Elu_Nyxc
 #pragma kernel LeakyRelu
 #pragma kernel LeakyRelu_CNyx
 #pragma kernel LeakyRelu_Nyxc
 #pragma kernel Exp
 #pragma kernel Exp_CNyx
 #pragma kernel Exp_Nyxc
 #pragma kernel Pow
 #pragma kernel Pow_CNyx
 #pragma kernel Pow_Nyxc
 #pragma kernel Softmax
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL_RW(O)
 float _Alpha;
 float relu(float v)
 {
 	return 0.5f * (v + abs(v));
 }
 float relu6(float v)
 {
 	return min(max(0, v), 6);
 }
 float swish(float v)
 {
 	return v / (1.f + exp(-v));
 }
 float sigmoid(float v)
 {
 	return 1.f / (1.f + exp(-v));
 }
 float elu(float v)
 {
 	if (v <= 0)
 		v = _Alpha * (exp(v) - 1);
 	return v;
 }
 float lrelu(float v)
 {
 	return max(v, _Alpha * v);	
 }
 float signed_pow(float f, float e)
 {
 	// handle negative f
 	float v = pow(abs(f), e);
 	float s = (e % 2 == 1) ?
 		sign(f):	// exponent is odd  => sign(f) * pow(abs(f), e)
 		1;			// exponent is even => pow(abs(f), e)
 	return v * s;
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Relu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = relu(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Relu6(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = relu6(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Tanh(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = tanh(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Sigmoid(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = sigmoid(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Swish(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = swish(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Elu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = elu(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void LeakyRelu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = lrelu(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Exp(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = exp(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Pow(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = signed_pow(v, _Alpha);
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Relu_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = relu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Relu_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = relu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Relu6_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = relu6(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Relu6_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = relu6(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Tanh_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = tanh(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Tanh_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = tanh(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Sigmoid_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = sigmoid(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Sigmoid_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = sigmoid(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Swish_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = swish(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Swish_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = swish(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Elu_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = elu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Elu_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = elu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void LeakyRelu_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = lrelu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void LeakyRelu_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = lrelu(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Exp_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = exp(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Exp_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = exp(v);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void Pow_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = signed_pow(v, _Alpha);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((512,1,1), (128,1,1), (64,1,1))
 void Pow_Nyxc(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.batch * O.height * O.width * O.channels, 1, 1)
 	TENSOR_ARGS2(X, O);
 	uint nyxc = dispatchThreadID.x;
 	uint c = nyxc % X.channels;
 	uint nyx = nyxc / X.channels;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (n >= X.batch) return;
 	float v = X.Get(n, y, x, c);
 	v = signed_pow(v, _Alpha);
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((64,4,1), (64,2,1), (64,1,1))
 void Softmax(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth, O.flatHeight, 1);
 	TENSOR_ARGS2(X, O);
 	uint x = dispatchThreadID.x;
 	uint y = dispatchThreadID.y;
 	if (x >= O.GetFlatWidth()) return;
 	if (y >= O.GetFlatHeight()) return;
 	float maxV = -FLT_MAX;
 	for (uint i = 0; i < X.GetFlatWidth(); ++i)
 	{
 		float v = X.Get(y, i);
 		if (v > maxV)
 			maxV = v;
 	}
 	float acc = 0.0f;
 	for (i = 0; i < X.GetFlatWidth(); ++i)
 	{
 		float v = X.Get(y, i);
 		acc += exp(v - maxV);
 	}
 	float v = X.Get(y, x);
 	v = exp(v - maxV) / acc;
 	O.Set(y, x, v);
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Activation.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Activation.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: fdc94044b2f234c0fa80ada3771a2ae7
 timeCreated: 1495527718
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/BarracudaReferenceImpl.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/BarracudaReferenceImpl.compute
@ -0,0 +1,885 @@
 #pragma kernel Dense
 #pragma kernel Conv2D
 #pragma kernel DepthwiseConv2D
 #pragma kernel Conv2DTrans
 #pragma kernel Upsample2D
 #pragma kernel Unstride2D
 #pragma kernel MaxPool2D
 #pragma kernel AvgPool2D
 #pragma kernel GlobalMaxPool2D
 #pragma kernel GlobalAvgPool2D
 #pragma kernel ScaleBias
 #pragma kernel InstanceNorm
 #pragma kernel Dropout
 #pragma kernel Relu
 #pragma kernel Swish
 #pragma kernel Softmax
 #pragma kernel Tanh
 #pragma kernel Sigmoid
 #pragma kernel Relu6
 #pragma kernel Elu
 #pragma kernel LeakyRelu
 #pragma kernel Exp
 #pragma kernel Pow
 #pragma kernel Copy
 #pragma kernel BroadcastAdd
 #pragma kernel BroadcastSub
 #pragma kernel BroadcastMul
 #pragma kernel BroadcastDiv
 #pragma kernel BroadcastPow
 #pragma kernel BroadcastMin
 #pragma kernel BroadcastMax
 #pragma kernel TextureToTensor
 #pragma kernel TensorToTexture
 #include "Tensor.cginc"
 #include "Random.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(W)
 TENSOR_DECL(K)
 TENSOR_DECL(B)
 TENSOR_DECL_RW(O)
 uint4 _Pad;
 uint4 _Pool;
 uint4 _Stride;
 float _Alpha;
 float _Seed;
 [numthreads(8,8,1)]
 void Dense(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth, O.flatHeight, 1);
 	TENSOR_ARGS4(X, W, B, O);
 	uint x = dispatchThreadID.x;
 	uint y = dispatchThreadID.y;
 	if (x >= O.GetFlatWidth()) return;
 	if (y >= O.GetFlatHeight()) return;
 	float acc = B.Get(x);
 	for (uint i = 0; i < X.GetFlatWidth(); ++i)
 		acc += X.Get(y, i) * W.Get(i, x);
 	O.Set(y, x, acc);
 }
 [numthreads(4,4,4)]
 void Relu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = 0.5f * (v + abs(v));
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Swish(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = v / (1 + exp(-v));
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Tanh(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = tanh(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Sigmoid(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = 1 / (1 + exp(-v));
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Relu6(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = min(max(0, v), 6);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Elu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		if (v <= 0)
 			v = _Alpha * (exp(v) - 1);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void LeakyRelu(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = max(v, _Alpha * v);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Exp(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = exp(v);
 		O.Set(n, y, x, c, v);
 	}
 }
 float signed_pow(float f, float e)
 {
 	// handle negative f
 	float v = pow(abs(f), e);
 	float s = (e % 2 == 1) ?
 		sign(f):	// exponent is odd  => sign(f) * pow(abs(f), e)
 		1;			// exponent is even => pow(abs(f), e)
 	return v * s;
 }
 [numthreads(4,4,4)]
 void Pow(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = signed_pow(v, _Alpha);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void BroadcastAdd(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) +
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastSub(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) -
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastMul(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < O.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) *
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastDiv(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) /
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastPow(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = signed_pow(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastMin(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = min(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void BroadcastMax(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = max(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
 [numthreads(4,4,4)]
 void Copy(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(X.channels, X.width, X.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= X.channels) return;	if (x >= X.width) return;		if (y >= X.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		O.Set(n + _Pad[0], y + _Pad[1], x + _Pad[2], c + _Pad[3], v);
 	}
 }
 [numthreads(4,4,4)]
 void Dropout(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float4 seed = float4(n / O.batch, y / O.height, x / O.width, c / O.channels);
 		seed = frac(seed + _Seed);
 		float v = X.Get(n, y, x, c);
 		v *= Bernoulli(seed, 1 - _Alpha) / (1 - _Alpha);
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void ScaleBias(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS4(X, W, B, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	float scale = W.Get(0, 0, 0, c);
 	float bias = B.Get(0, 0, 0, c);
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = v * scale + bias;
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(16,4,1)]
 void Softmax(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth, O.flatHeight, 1);
 	TENSOR_ARGS2(X, O);
 	uint x = dispatchThreadID.x;
 	uint y = dispatchThreadID.y;
 	if (x >= O.GetFlatWidth()) return;
 	if (y >= O.GetFlatHeight()) return;
 	float maxV = -FLT_MAX;
 	for (uint i = 0; i < X.GetFlatWidth(); ++i)
 	{
 		float v = X.Get(y, i);
 		if (v > maxV)
 			maxV = v;
 	}
 	float acc = 0.0f;
 	for (i = 0; i < X.GetFlatWidth(); ++i)
 	{
 		float v = X.Get(y, i);
 		acc += exp(v - maxV);
 	}
 	float v = X.Get(y, x);
 	v = exp(v - maxV) / acc;
 	O.Set(y, x, v);
 }
 [numthreads(4,4,4)]
 void Upsample2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(X.channels, X.width, X.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= X.channels) return;
 	if (x >= X.width) return;
 	if (y >= X.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint oy = y * _Pool.y + dy;
 				uint ox = x * _Pool.x + dx;
 				O.Set(n, oy, ox, c, v);
 			}
 	}
 }
 [numthreads(4,4,4)]
 void MaxPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float maxV = -FLT_MAX;
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint2 pos = uint2(x, y) * _Stride.xy + uint2(dx, dy);
 				float v = X.SafeGet(n, pos, c, _Pad.xy);
 				maxV = max(v, maxV);
 			}
 		O.Set(n, y, x, c, maxV);
 	}
 }
 [numthreads(4,4,4)]
 void AvgPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	uint2 leftCorner = _Pad.xy;
 	uint2 rightCorner = uint2(X.width, X.height) + _Pad.xy;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float acc = 0;
 		float counter = 0;
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint2 pos = uint2(x, y) * _Stride.xy + uint2(dx, dy);
 				bool mask = all(pos >= leftCorner) && all(pos < rightCorner);
 				acc += (mask)? X.Get(n, pos.y - leftCorner.y, pos.x - leftCorner.x, c): 0;
 				counter += (mask)? 1: 0;
 			}
 		acc /= counter;
 		O.Set(n, y, x, c, acc);
 	}
 }
 [numthreads(32,1,1)]
 void GlobalMaxPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, 1, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	if (c >= O.channels) return;
 	//ASSERT(X.batch == O.batch)
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float maxV = -FLT_MAX;
 		for (uint y = 0; y < X.height; ++y)
 			for (uint x = 0; x < X.width; ++x)
 			{
 				float v = X.Get(n, y, x, c);
 				maxV = max(v, maxV);
 			}
 		O.Set(n, 0, 0, c, maxV);
 	}
 }
 [numthreads(32,1,1)]
 void GlobalAvgPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, 1, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	if (c >= O.channels) return;
 	//ASSERT(X.batch == O.batch)
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = 0;
 		for (uint y = 0; y < X.height; ++y)
 			for (uint x = 0; x < X.width; ++x)
 				v += X.Get(n, y, x, c);
 		v /= (X.height * X.width);
 		O.Set(n, 0, 0, c, v);
 	}
 }
 [numthreads(32,1,1)]
 void InstanceNorm(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, 1, 1);
 	TENSOR_ARGS4(X, W, B, O);
 	uint c = dispatchThreadID.x;
 	if (c >= O.channels) return;
 	//ASSERT(X.shape == O.shape)
 	float gamma = W.Get(0, 0, 0, c);
 	float beta = B.Get(0, 0, 0, c);
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		uint x, y;
 		// calc mean
 		float acc = 0;
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 				acc += X.Get(n, y, x, c);
 		float mean = acc / (O.width * O.height);
 		// calc variance
 		acc = 0;
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 			{
 				float delta = X.Get(n, y, x, c) - mean;
 				acc += delta * delta;
 			}
 		float var = acc / (O.width * O.height);
 		// normalization factor
 		float invNormFactor = 1 / sqrt(var + FLT_EPSILON);
 		float scale = gamma * invNormFactor;
 		float bias = beta - gamma * mean * invNormFactor;
 		// apply normalization
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 			{
 				float v = X.Get(n, y, x, c);
 				v = v * scale + bias;
 				O.Set(n, y, x, c, v);
 			}
 	}
 }
 [numthreads(4,4,4)]
 void Conv2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_ARGS4(X, K, B, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint2 pos = uint2(x, y) * _Stride.xy + uint2(dx, dy);
 				for (uint c = 0; c < X.channels; ++c)
 				{
 					float v = X.SafeGet(n, pos, c, _Pad.xy);
 					acc += v * K.Get(dy, dx, c, k);
 				}
 			}
 		}
 		O.Set(n, y, x, k, acc);
 	}
 }
 NUMTHREADS((16,4,4), (8,4,4), (4,4,4))
 void DepthwiseConv2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_ARGS4(X, K, B, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint2 pos = uint2(x, y) * _Stride.xy + uint2(dx, dy);
 				float v = X.SafeGet(n, pos, k, _Pad.xy);
 				acc += v * K.Get(dy, dx, 0, k);
 			}
 		O.Set(n, y, x, k, acc);
 	}
 }
 [numthreads(4,4,4)]
 void Unstride2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		int xx = (int)x - (int)_Pad.x;
 		int yy = (int)y - (int)_Pad.y;
 		int my = yy % _Stride.y;
 		int mx = xx % _Stride.x;
 		int oy = yy / _Stride.y;
 		int ox = xx / _Stride.x;
 		bool mask = ox >= 0 && oy >= 0 && ox < (int)X.width && oy < (int)X.height &&
 			my == 0 && mx == 0;
 		float v = mask ? X.Get(n, (uint)oy, (uint)ox, c) : 0;
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(4,4,4)]
 void Conv2DTrans(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_ARGS4(X, K, B, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	uint2 strideMask = _Stride.xy - 1;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc = B.Get(k);
 		for (uint dy = y & strideMask.y; dy < K.GetKernelHeight(); dy += _Stride.y)
 		{
 			for (uint dx = x & strideMask.x; dx < K.GetKernelWidth(); dx += _Stride.x)
 			{
 				for (uint c = 0; c < X.channels; ++c)
 				{
 					uint xx = x + dx;
 					uint yy = y + dy;
 					uint oy = (yy - _Pad.y) / _Stride.y;
 					uint ox = (xx - _Pad.x) / _Stride.x;
 					bool mask = xx >= _Pad.x && yy >= _Pad.y && ox < X.width && oy < X.height;
 					float v = (mask)? X.Get(n, oy, ox, c): 0;
 					acc += v * K.Get(K.GetKernelHeight() - 1 - dy, K.GetKernelWidth() - 1 - dx, c, k);
 				}
 			}
 		}
 		O.Set(n, y, x, k, acc);
 	}
 }
 Texture2D<float4> Xtex2D;
 Texture3D<float4> Xtex3D;
 Texture2DArray<float4> Xtex2DArray;
 SamplerState samplerXtex2D { Filter = MIN_MAG_LINEAR_MIP_POINT; AddressU = Clamp; AddressV = Clamp; };
 SamplerState samplerXtex3D { Filter = MIN_MAG_LINEAR_MIP_POINT; AddressU = Clamp; AddressV = Clamp; AddressW = Clamp; };
 SamplerState samplerXtex2DArray { Filter = MIN_MAG_LINEAR_MIP_POINT; AddressU = Clamp; AddressV = Clamp; };
 RWTexture2D<float4> Otex2D;
 RWTexture3D<float4> Otex3D;
 RWTexture2DArray<float4> Otex2DArray;
 bool _FlipY;
 // TODO: call TextureToTensor(v, dispatchThreadID) from Tex2DToTensor() { v = Xtex2D.SampleLevel }
 [numthreads(8,8,1)]
 void TextureToTensor(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	TENSOR_ARG_RW(O);
 	uint b = _Pad.x;
 	uint x = dispatchThreadID.x + _Pad.y;
 	uint y = dispatchThreadID.y + _Pad.z;
 	uint c = dispatchThreadID.z + _Pad.w;
 	// calculate texture coordinates:
 	//  offset by 0.5 to get texel centers
 	//  divide by texture resolution (_Pool)
 	float3 uvw = (float3)dispatchThreadID + float3(0.5f, 0.5f, 0);
 	uvw /= (float3)_Pool.xyz;
 	if (_FlipY)
 		uvw.y = 1 - uvw.y;
 	float4 v = Xtex2D.SampleLevel(samplerXtex2D, uvw.xy, 0);
 	//texArray.SampleLevel(smpArray, loc, 0);
 	if (_Stride.w == 1)
 	{
 		// TODO: interpret color as
 		O.Set(b, y, x, c+0, (v.r + v.g + v.b) / 3.0f);
 	}
 	else if (_Stride.w == 3)
 	{
 		O.Set(b, y, x, c+0, v.r);
 		O.Set(b, y, x, c+1, v.g);
 		O.Set(b, y, x, c+2, v.b);
 	}
 	else if (_Stride.w == 4)
 	{
 		O.Set(b, y, x, c+0, v.r);
 		O.Set(b, y, x, c+1, v.g);
 		O.Set(b, y, x, c+2, v.b);
 		O.Set(b, y, x, c+3, v.a);
 	}
 }
 [numthreads(8,8,1)]
 void TensorToTexture(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	TENSOR_ARG(X);
 	uint b = _Pad.x;
 	uint x = dispatchThreadID.x + _Pad.y;
 	uint y = dispatchThreadID.y + _Pad.z;
 	uint c = dispatchThreadID.z + _Pad.w;
 	if (_FlipY)
 		y = X.height - 1 - y;
 	float4 v = 0;
 	if (X.channels - c == 1)
 	{
 		// broadcast to all channels
 		v = X.Get(b, y, x, c);
 	}
 	else if (X.channels - c == 3)
 	{
 		v.r = X.Get(b, y, x, c+0);
 		v.g = X.Get(b, y, x, c+1);
 		v.b = X.Get(b, y, x, c+2);
 		v.a = 1;
 	}
 	else if (X.channels - c >= 4)
 	{
 		v.r = X.Get(b, y, x, c+0);
 		v.g = X.Get(b, y, x, c+1);
 		v.b = X.Get(b, y, x, c+2);
 		v.a = X.Get(b, y, x, c+3);
 	}
 	Otex2D[dispatchThreadID.xy] = v;
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/BarracudaReferenceImpl.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/BarracudaReferenceImpl.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: b4b1b304aae6c404cb0cdab46b8fa084
 timeCreated: 1495527718
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Broadcast.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Broadcast.compute
@ -0,0 +1,149 @@
 #pragma kernel BroadcastAdd
 #pragma kernel BroadcastSub
 #pragma kernel BroadcastMul
 #pragma kernel BroadcastDiv
 #pragma kernel BroadcastPow
 #pragma kernel BroadcastMin
 #pragma kernel BroadcastMax
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(B)
 TENSOR_DECL_RW(O)
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastAdd(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) +
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastSub(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) -
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastMul(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < O.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) *
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastDiv(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v =
            X.BroadcastGet(n, y, x, c) /
            B.BroadcastGet(n, y, x, c);
        O.Set(n, y, x, c, v);
    }
 }
 float signed_pow(float f, float e)
 {
 	// handle negative f
 	float v = pow(abs(f), e);
 	float s = (e % 2 == 1) ?
 		sign(f):	// exponent is odd  => sign(f) * pow(abs(f), e)
 		1;			// exponent is even => pow(abs(f), e)
 	return v * s;
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastPow(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = signed_pow(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastMin(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = min(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void BroadcastMax(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS3(X, B, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;	if (x >= O.width) return;		if (y >= O.height) return;
    for (uint n = 0; n < X.batch; ++n)
    {
        float v = max(
            X.BroadcastGet(n, y, x, c),
            B.BroadcastGet(n, y, x, c));
        O.Set(n, y, x, c, v);
    }
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Broadcast.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Broadcast.compute.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 72dd00e416ab94bd79e7264a1fadef9d
 ComputeShaderImporter:
  externalObjects: {}
  currentAPIMask: 65536
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Conv.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Conv.compute
@ -0,0 +1,396 @@
 #pragma kernel Conv2D
 #pragma kernel Conv2D_RegisterBlock4x2
 //#pragma kernel Conv2D_L1Cached64_RegisterBlock4x4
 #pragma kernel DepthwiseConv2D
 #pragma kernel Conv2DTrans
 #pragma kernel Conv2DTrans_L1Cached64_RegisterBlock2x2
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(K)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 uint4 _Pad;
 uint4 _Stride;
 NUMTHREADS((16,4,4), (8,4,4), (4,4,4))
 void Conv2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	uint2 leftCorner = _Pad.xy;
 	uint2 rightCorner = uint2(X.width, X.height) + _Pad.xy;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint2 pos = uint2(x, y) * _Stride.xy + uint2(dx, dy);
 				// @TODO: investigate
 				// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 				if (any(pos < leftCorner)) continue;
 				if (any(pos >= rightCorner)) continue;
 				for (uint c = 0; c < X.channels; ++c)
 					acc = fastfma(X.Get(n, pos.y - leftCorner.y, pos.x - leftCorner.x, c),  K.Get(dy, dx, c, k), acc);
 			}
 		}
 		O.Set(n, y, x, k, acc);
 	}
 }
 #define SIZE_W 4
 #define SIZE_H 2
 NUMTHREADS((64, 2, 2), (32, 2, 2), (16, 2, 2))
 void Conv2D_RegisterBlock4x2(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x*SIZE_W >= O.width) return;
 	if (y*SIZE_H >= O.height) return;
 	uint2 leftCorner = _Pad.xy;
 	uint2 rightCorner = uint2(X.width, X.height) + _Pad.xy;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc[SIZE_H*SIZE_W];
 		[unroll]
 		for (uint q = 0; q < SIZE_H*SIZE_W; ++q)
 			acc[q] = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint2 pos[SIZE_H*SIZE_W];
 				[unroll]
 				for (uint q = 0; q < SIZE_H*SIZE_W; ++q)
 					pos[q] = uint2(x*SIZE_W+(q%SIZE_W), y*SIZE_H+(q/SIZE_W)) * _Stride.xy + uint2(dx, dy);
 				for (uint c = 0; c < X.channels; ++c)
 					[unroll]
 					for (q = 0; q < SIZE_H*SIZE_W; ++q)
 						if (all(pos[q] >= leftCorner) && all(pos[q] < rightCorner))
 							acc[q] = fastfma(X.Get(n, pos[q] - leftCorner, c), K.Get(dy, dx, c, k), acc[q]);
 			}
 		}
 		[unroll]
 		for (q = 0; q < SIZE_H*SIZE_W; ++q)
 			O.Set(n, y*SIZE_H+(q/SIZE_W), x*SIZE_W+(q%SIZE_W), k, acc[q]);
 	}
 }
 #undef SIZE_W
 #undef SIZE_H
 #undef L1CACHESIZE
 #define L1CACHESIZE 64
 #undef SIZE
 #define SIZE 4
 groupshared float Conv2D_L1Cached64_Reg_Loop_safe_X[SIZE*SIZE][L1CACHESIZE];
 [numthreads(L1CACHESIZE, 1, 1)]
 void Conv2D_L1Cached64_RegisterBlock4x4(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv2D_L1Cached64_Reg_Loop_safe_X
 	uint k = L1CACHESIZE * groupID.x + groupThreadID.x;
 	uint x = groupID.y;
 	uint y = groupID.z;
 	// need all threads to load channels, thus will do late check against kernel count
 	if (x*SIZE >= O.width) return;
 	if (y*SIZE >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc[SIZE*SIZE];
 		[unroll]
 		for (uint q = 0; q < SIZE*SIZE; ++q)
 			acc[q] = B.SafeGet(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint2 pos[SIZE*SIZE];
 				[unroll]
 				for (uint q = 0; q < SIZE*SIZE; ++q)
 					pos[q] = uint2(x*SIZE+(q%SIZE), y*SIZE+(q/SIZE)) * _Stride.xy + uint2(dx, dy);
 				for (uint c = 0; c < X.channels; c += L1CACHESIZE)
 				{
 					// Cache X
 					uint dc = groupThreadID.x;
 					[unroll]
 					for (q = 0; q < SIZE*SIZE; ++q)
 						X_[q][dc] = X.SafeGet(n, pos[q], c + dc, _Pad.xy);
 					GroupMemoryBarrierWithGroupSync();
 					// X * K
 					if (k < K.channels) // need all threads to load channels, thus late check against kernel count
 					{
 						uint kIndex = K.Index(dy, dx, c, k);
 						for (dc = 0; dc < L1CACHESIZE; ++dc)
 						{
 							[unroll]
 							for (q = 0; q < SIZE*SIZE; ++q)
 								acc[q] = fastfma(X_[q][dc], K.data[kIndex], acc[q]);
 							kIndex += K.channels;
 						}
 					}
 					GroupMemoryBarrierWithGroupSync();
 				}
 			}
 		}
 		uint remainderW = (O.width - x*SIZE);
 		uint remainderH = (O.height - y*SIZE);
 		if (k < K.channels) // need all threads to load channels, thus late check against kernel count
 			[unroll]
 			for (q = 0; q < SIZE*SIZE; ++q)
 				if (q/SIZE < remainderH && q%SIZE < remainderW)
 					O.Set(n, y*SIZE+(q/SIZE), x*SIZE+(q%SIZE), k, acc[q]);
 	}
 	#undef X_
 }
 NUMTHREADS((16,4,4), (8,4,4), (4,4,4))
 void DepthwiseConv2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	uint2 leftCorner = _Pad.xy;
 	uint2 rightCorner = uint2(X.width, X.height) + _Pad.xy;
 	uint2 leftKernelCorner = uint2(x, y) * _Stride.xy;
 	uint2 rightKernelCorner = leftKernelCorner + uint2(K.GetKernelWidth(), K.GetKernelHeight());
 	if (any(leftKernelCorner < leftCorner) || any(rightKernelCorner >= rightCorner))
 	{
 		// path with edge-cases checks
 		for (uint n = 0; n < O.batch; ++n)
 		{
 			float acc = B.Get(k);
 			for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 				for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 				{
 					uint2 pos = leftKernelCorner + uint2(dx, dy);
 					if (any(pos < leftCorner)) continue;
 					if (any(pos >= rightCorner)) continue;
 					acc = fastfma(
 						X.Get(n, pos.y - leftCorner.y, pos.x - leftCorner.x, k), 
 						K.Get(dy, dx, 0, k),
 						acc);
 				}
 			O.Set(n, y, x, k, acc);
 		}
 	}
 	else
 	{
 		// kernel is guaranteed to be within X,
 		// no need to check against edge-cases
 		leftKernelCorner -= leftCorner;
 		for (uint n = 0; n < O.batch; ++n)
 		{
 			float acc = B.Get(k);
 			for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 				for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 				{
 					uint2 pos = leftKernelCorner + uint2(dx, dy);
 					acc = fastfma(
 						X.Get(n, pos, k), 
 						K.Get(dy, dx, 0, k),
 						acc);
 				}
 			O.Set(n, y, x, k, acc);
 		}
 	}
 }
 // Significantly faster than Conv2DTrans
 [numthreads(16,2,2)]
 void Conv2DTrans(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(K.kernelCount, X.width, X.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= X.width) return;
 	if (y >= X.height) return;
 	uint2 pad = _Pad.xy / _Stride.xy;
 	uint2 leftCorner = pad;
 	uint2 rightCorner = uint2(X.width, X.height) + pad;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		for (uint sy = 0; sy < _Stride.y; ++sy)
 		{
 			for (uint sx = 0; sx < _Stride.x; ++sx)
 			{
 				float acc = B.Get(k);
 				for (uint dy = sy; dy < K.GetKernelHeight(); dy += _Stride.y)
 				{
 					for (uint dx = sx; dx < K.GetKernelWidth(); dx += _Stride.x)
 					{
 						uint2 pos = uint2(x, y) + uint2(sx + dx, sy + dy) / _Stride.xy;
 						if (any(pos < leftCorner)) continue;
 						if (any(pos >= rightCorner)) continue;
 						for (uint c = 0; c < X.channels; ++c)
 						{
 							acc = fastfma(	X.Get(n, pos - leftCorner, c),
 											K.Get(	K.GetKernelHeight() - 1 - dy,
 													K.GetKernelWidth()  - 1 - dx, c, k),
 											acc);
 						}
 					}
 				}
 				uint oy = y * _Stride.y + sy;
 				uint ox = x * _Stride.x + sx;
 				if (oy < O.height && ox < O.width)
 					O.Set(n, oy, ox, k, acc);
 			}
 		}
 	}
 }
 #undef L1CACHESIZE
 #define L1CACHESIZE 64
 #undef SIZE
 #define SIZE 2
 groupshared float Conv2DTrans_L1Cached64_Reg_Loop_safe_X[SIZE*SIZE][L1CACHESIZE];
 [numthreads(L1CACHESIZE, 1, 1)]
 void Conv2DTrans_L1Cached64_RegisterBlock2x2(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(K.kernelCount, X.width / SIZE, X.height / SIZE);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv2DTrans_L1Cached64_Reg_Loop_safe_X
 	uint k = L1CACHESIZE * groupID.x + groupThreadID.x;
 	uint x = groupID.y;
 	uint y = groupID.z;
 	// need all threads to load channels, thus will do late check against kernel count
 	if (x*SIZE >= X.width) return;
 	if (y*SIZE >= X.height) return;
 	uint2 pad = _Pad.xy / _Stride.xy;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		for (uint sy = 0; sy < _Stride.y; ++sy)
 		{
 			for (uint sx = 0; sx < _Stride.x; ++sx)
 			{
 				float acc[SIZE*SIZE];
 				[unroll]
 				for (uint q = 0; q < SIZE*SIZE; ++q)
 					acc[q] = B.SafeGet(k);
 				for (uint dy = sy; dy < K.GetKernelHeight(); dy += _Stride.y)
 				{
 					for (uint dx = sx; dx < K.GetKernelWidth(); dx += _Stride.x)
 					{
 						uint2 pos[SIZE*SIZE];
 						[unroll]
 						for (uint q = 0; q < SIZE*SIZE; ++q)
 							pos[q] = uint2(x*SIZE+(q%SIZE), y*SIZE+(q/SIZE)) + uint2(dx+sx, dy+sy) / _Stride.xy;
 						for (uint c = 0; c < X.channels; c += L1CACHESIZE)
 						{
 							// Cache X
 							uint dc = groupThreadID.x;
 							[unroll]
 							for (q = 0; q < SIZE*SIZE; ++q)
 								X_[q][dc] = X.SafeGet(n, pos[q], c + dc, pad);
 							GroupMemoryBarrierWithGroupSync();
 							// X * K
 							if (k < K.channels) // need all threads to load channels, thus late check against kernel count
 							{
 								//uint kIndex = K.Index(dy, dx, c, k);
 								for (dc = 0; dc < L1CACHESIZE; ++dc)
 								{
 									[unroll]
 									for (q = 0; q < SIZE*SIZE; ++q)
 										acc[q] = fastfma(	X_[q][dc],
 															K.Get(	K.GetKernelHeight() - 1 - dy,
 																	K.GetKernelWidth()  - 1 - dx, c + dc, k),
 															acc[q]);
 									//kIndex += K.channels;
 								}
 							}
 							GroupMemoryBarrierWithGroupSync();
 						}
 					}
 				}
 				if (k < K.channels) // need all threads to load channels, thus late check against kernel count
 					[unroll]
 					for (q = 0; q < SIZE*SIZE; ++q)
 					{
 						uint ox = (x*SIZE+(q%SIZE)) * _Stride.x + sx;
 						uint oy = (y*SIZE+(q/SIZE)) * _Stride.y + sy;
 						if (ox < O.width && oy < O.height)
 							O.Set(n, oy, ox, k, acc[q]);
 					}
 			}
 		}
 	}
 	#undef X_
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Conv.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Conv.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 7f508b82f984146e8bf0ad8520c316c7
 timeCreated: 1507457340
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/ConvOld.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/ConvOld.compute
@ -0,0 +1,418 @@
 //#pragma kernel Conv2D_Kmod16_Nmod8_KNY
 //#pragma kernel Conv2D_Cache_KCmod32_KNyx
 //#pragma kernel Conv2D_Cache_KCmod32_KNyxDiv2
 // NOTE: DISABLED 64 version because as it is slower than 32 version on AMD GPU
 //#pragma kernel Conv2D_Cache_KCmod64_KNyx
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(K)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 uint4 _Pad;
 uint4 _Stride;
 NUMTHREADS((16,8,1), (16,8,1), (16,4,1))
 void Conv2D_Kmod16_Nmod8_KNY(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(K.channels, O.batch, O.height);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	uint k = dispatchThreadID.x;
 	uint n = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	for (uint x = 0; x < O.width; ++x)
 	{
 		float v = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint oy = y * _Stride.y + dy;
 				uint ox = x * _Stride.x + dx;
 				// @TODO: investigate
 				// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 				if (oy < _Pad.y) continue;
 				if (oy - _Pad.w >= X.height) continue;
 				if (ox < _Pad.x) continue;
 				if (ox - _Pad.z >= X.width) continue;
 				for (uint c = 0; c < X.channels; ++c)
 				{
 					v += X.Get(n, oy-_Pad.y, ox-_Pad.x, c) * K.Get(dy, dx, c, k);
 				}
 			}
 		}
 		O.Set(n, y, x, k, v);
 	}
 }
 #undef CTILE
 #define CTILE NUMTHREAD(16, 8, 8)
 groupshared float Conv_Xcache[4][CTILE][CTILE];
 groupshared float Conv_Kcache[4][CTILE][CTILE];
 [numthreads(CTILE, CTILE, 1)]
 void Conv2D_Cache_KCmod32_KNyx(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount / 2, O.batch * O.height * O.width / 2, 1);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv_Xcache
 	#define K_ Conv_Kcache
 	uint gx = groupThreadID.x;
 	uint gy = groupThreadID.y;
 	uint k = CTILE * groupID.x + groupThreadID.x;
 	uint nyx = CTILE * groupID.y + groupThreadID.y;
 	uint width = O.width;
 	uint height = O.height;
 	uint x = nyx % width;
 	uint ny = nyx / width;
 	uint y = ny % height;
 	uint n = ny / height;
 	float b0 = B.Get(k*2+0);
 	float b1 = B.Get(k*2+1);
 	float4 v = float4(b0, b1,
 					  b0, b1);
 	for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 	{
 		for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 		{
 			bool mask = true;
 			uint oy = y * _Stride.y + dy;
 			uint ox = x * _Stride.x + dx;
 			// @TODO: investigate
 			// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 			if (oy < _Pad.y) mask = false;
 			if (oy - _Pad.w >= X.height) mask = false;
 			if (ox < _Pad.x) mask = false;
 			if (ox - _Pad.z >= X.width) mask = false;
 			for (uint m = 0; m < X.channels/(CTILE*2); ++m)
 			{
 				float x0 = 0;
 				float x1 = 0;
 				float x2 = 0;
 				float x3 = 0;
 				if (mask)
 				{
 					x0 = X.Get(n*2+0, oy-_Pad.y, ox-_Pad.x, (m*CTILE + gx)*2+0);
 					x1 = X.Get(n*2+0, oy-_Pad.y, ox-_Pad.x, (m*CTILE + gx)*2+1);
 					x2 = X.Get(n*2+1, oy-_Pad.y, ox-_Pad.x, (m*CTILE + gx)*2+0);
 					x3 = X.Get(n*2+1, oy-_Pad.y, ox-_Pad.x, (m*CTILE + gx)*2+1);
 				}
 				float k0 = K.Get(dy, dx, (m*CTILE + gy)*2+0, k*2+0);
 				float k1 = K.Get(dy, dx, (m*CTILE + gy)*2+0, k*2+1);
 				float k2 = K.Get(dy, dx, (m*CTILE + gy)*2+1, k*2+0);
 				float k3 = K.Get(dy, dx, (m*CTILE + gy)*2+1, k*2+1);
 				//X_[gy][gx] = float4(x0, x1,
 				//					x2, x3);
 				//K_[gy][gx] = float4(k0, k1,
 				//					k2, k3);
 				X_[0][gy][gx] = x0;
 				X_[1][gy][gx] = x1;
 				X_[2][gy][gx] = x2;
 				X_[3][gy][gx] = x3;
 				K_[0][gy][gx] = k0;
 				K_[1][gy][gx] = k1;
 				K_[2][gy][gx] = k2;
 				K_[3][gy][gx] = k3;
 				GroupMemoryBarrierWithGroupSync();
 				[unroll]
 				for (uint i = 0; i < CTILE; ++i)
 				{
 					float4 x = //X_[gy][i];
 						float4(	X_[0][gy][i],
 								X_[1][gy][i],
 								X_[2][gy][i],
 								X_[3][gy][i]);
 					float4 k = //K_[i][gx];
 						float4(	K_[0][i][gx],
 								K_[1][i][gx],
 								K_[2][i][gx],
 								K_[3][i][gx]);
 					v.x = mad(k.x, x.x, v.x);
 					v.x = mad(k.z, x.y, v.x);
 					v.y = mad(k.y, x.x, v.y);
 					v.y = mad(k.w, x.y, v.y);
 					v.z = mad(k.x, x.z, v.z);
 					v.z = mad(k.z, x.w, v.z);
 					v.w = mad(k.y, x.z, v.w);
 					v.w = mad(k.w, x.w, v.w);
 					//v.x += k.x*x.x + k.z*x.y;
 					//v.y += k.y*x.x + k.w*x.y;
 					//v.z += k.x*x.z + k.z*x.w;
 					//v.w += k.y*x.z + k.w*x.w;
 				}
 				GroupMemoryBarrierWithGroupSync();
 			}
 		}
 	}
 	O.Set(n*2+0, y, x, k*2+0, v.x);
 	O.Set(n*2+0, y, x, k*2+1, v.y);
 	O.Set(n*2+1, y, x, k*2+0, v.z);
 	O.Set(n*2+1, y, x, k*2+1, v.w);
 	#undef X_
 	#undef K_
 }
 #undef CTILE
 //#define CTILE NUMTHREAD(16, 8, 8)
 #define CTILE 16
 groupshared float Conv_Xcache2[4][CTILE][CTILE];
 groupshared float Conv_Kcache2[4][CTILE][CTILE];
 [numthreads(CTILE, CTILE, 1)]
 void Conv2D_Cache_KCmod32_KNyxDiv2(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount / 2, O.batch * O.height * O.width / 2, 1);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv_Xcache2
 	#define K_ Conv_Kcache2
 	uint gx = groupThreadID.x;
 	uint gy = groupThreadID.y;
 	uint k = CTILE * groupID.x + groupThreadID.x;
 	uint nyx = CTILE * groupID.y + groupThreadID.y;
 	uint width = O.width / 2;
 	uint height = O.height;
 	uint x = nyx % width;
 	uint ny = nyx / width;
 	uint y = ny % height;
 	uint n = ny / height;
 	float b0 = B.Get(k*2+0);
 	float b1 = B.Get(k*2+1);
 	float4 v = float4(b0, b1,
 					  b0, b1);
 	bool mask = n < O.batch;
 	for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 	{
 		for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 		{
 			// @TODO: investigate
 			// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 			bool maskY = mask;
 			uint oy = y * _Stride.y + dy;
 			if (oy < _Pad.y) maskY = false;
 			if (oy - _Pad.w >= X.height) maskY = false;
 			bool maskL = maskY;
 			uint oxL = (x*2+0) * _Stride.x + dx;
 			if (oxL < _Pad.x) maskL = false;
 			if (oxL - _Pad.z >= X.width) maskL = false;
 			bool maskR = maskY;
 			uint oxR = (x*2+1) * _Stride.x + dx;
 			if (oxR < _Pad.x) maskR = false;
 			if (oxR - _Pad.z >= X.width) maskR = false;
 			for (uint m = 0; m < X.channels/(CTILE*2); ++m)
 			{
 				if (maskL)
 				{
 					X_[0][gy][gx] = X.Get(n, oy-_Pad.y, oxL-_Pad.x, (m*CTILE + gx)*2+0);
 					X_[1][gy][gx] = X.Get(n, oy-_Pad.y, oxL-_Pad.x, (m*CTILE + gx)*2+1);
 				}
 				else
 				{
 					X_[0][gy][gx] = X_[1][gy][gx] = 0;
 				}
 				if (maskR)
 				{
 					X_[2][gy][gx] = X.Get(n, oy-_Pad.y, oxR-_Pad.x, (m*CTILE + gx)*2+0);
 					X_[3][gy][gx] = X.Get(n, oy-_Pad.y, oxR-_Pad.x, (m*CTILE + gx)*2+1);
 				}
 				else
 				{
 					X_[2][gy][gx] = X_[3][gy][gx] = 0;
 				}
 				K_[0][gy][gx] = K.Get(dy, dx, (m*CTILE + gy)*2+0, k*2+0);
 				K_[1][gy][gx] = K.Get(dy, dx, (m*CTILE + gy)*2+0, k*2+1);
 				K_[2][gy][gx] = K.Get(dy, dx, (m*CTILE + gy)*2+1, k*2+0);
 				K_[3][gy][gx] = K.Get(dy, dx, (m*CTILE + gy)*2+1, k*2+1);
 				GroupMemoryBarrierWithGroupSync();
 				[unroll]
 				for (uint i = 0; i < CTILE; ++i)
 				{
 					float4 x =
 						float4(	X_[0][gy][i],
 								X_[1][gy][i],
 								X_[2][gy][i],
 								X_[3][gy][i]);
 					float4 k =
 						float4(	K_[0][i][gx],
 								K_[1][i][gx],
 								K_[2][i][gx],
 								K_[3][i][gx]);
 					v.x = mad(k.x, x.x, v.x);
 					v.x = mad(k.z, x.y, v.x);
 					v.y = mad(k.y, x.x, v.y);
 					v.y = mad(k.w, x.y, v.y);
 					v.z = mad(k.x, x.z, v.z);
 					v.z = mad(k.z, x.w, v.z);
 					v.w = mad(k.y, x.z, v.w);
 					v.w = mad(k.w, x.w, v.w);
 				}
 				GroupMemoryBarrierWithGroupSync();
 			}
 		}
 	}
 	O.Set(n, y, x*2+0, k*2+0, v.x);
 	O.Set(n, y, x*2+0, k*2+1, v.y);
 	if (mask && x*2+1 < O.width)
 	{
 		O.Set(n, y, x*2+1, k*2+0, v.z);
 		O.Set(n, y, x*2+1, k*2+1, v.w);
 	}
 	#undef X_
 	#undef K_
 }
 #undef CTILE
 //#define CTILE NUMTHREAD(16, 8, 8)
 #define CTILE 16
 #define RTILE 4
 groupshared float Conv_XcacheR[RTILE*RTILE][CTILE*CTILE];
 groupshared float Conv_KcacheR[RTILE*RTILE][CTILE*CTILE];
 [numthreads(CTILE, CTILE, 1)]
 void Conv2D_Cache_KCmod64_KNyx(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(K.kernelCount / 4, O.batch * O.height * O.width / 4, 1);
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv_XcacheR
 	#define K_ Conv_KcacheR
 	uint gx = groupThreadID.x;
 	uint gy = groupThreadID.y;
 	uint k = CTILE * groupID.x + groupThreadID.x;
 	uint nyx = CTILE * groupID.y + groupThreadID.y;
 	uint x = nyx % O.width;
 	uint ny = nyx / O.width;
 	uint y = ny % O.height;
 	uint n = ny / O.height;
 	float v[RTILE][RTILE];
 	for (uint xxxx = 0; xxxx < RTILE; ++xxxx)
 	{
 		float b = B.Get(k*RTILE+xxxx);
 		for (uint yyyy = 0; yyyy < RTILE; ++yyyy)
 			v[yyyy][xxxx] = b;
 	}
 	for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 	{
 		for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 		{
 			bool mask = true;
 			uint oy = y * _Stride.y + dy;
 			uint ox = x * _Stride.x + dx;
 			// @TODO: investigate
 			// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 			if (oy < _Pad.y) mask = false;
 			if (oy - _Pad.w >= X.height) mask = false;
 			if (ox < _Pad.x) mask = false;
 			if (ox - _Pad.z >= X.width) mask = false;
 			for (uint m = 0; m < X.channels/(CTILE*RTILE); ++m)
 			{				
 				for (uint yy = 0; yy < RTILE; ++yy)
 					for (uint xx = 0; xx < RTILE; ++xx)
 					{
 						if (mask)
 							X_[yy*RTILE+xx][gy*CTILE+gx] = X.Get(n*RTILE+yy, oy - _Pad.y, ox - _Pad.x, (m*CTILE + gx)*RTILE+xx);
 						else
 							X_[yy*RTILE+xx][gy*CTILE+gx] = 0;
 						K_[yy*RTILE+xx][gy*CTILE+gx] = K.Get(dy, dx, (m*CTILE + gy)*RTILE+yy, k*RTILE+xx);
 					}
 				GroupMemoryBarrierWithGroupSync();
 				for (uint ii = 0; ii < CTILE; ++ii)
 				{
 					float x[RTILE][RTILE];
 					float k[RTILE][RTILE];
 					[unroll]
 					for (uint yy = 0; yy < RTILE; ++yy)
 					{
 						[unroll]
 						for (uint xx = 0; xx < RTILE; ++xx)
 						{
 							x[yy][xx] = X_[yy*RTILE+xx][gy*CTILE+ii];
 							k[yy][xx] = K_[yy*RTILE+xx][ii*CTILE+gx];
 						}
 					}
 					[unroll]
 					for (uint yyy = 0; yyy < RTILE; ++yyy)
 					{
 						[unroll]
 						for (uint xxx = 0; xxx < RTILE; ++xxx)
 						{
 							[unroll]
 							for (uint i = 0; i < RTILE; ++i)
 							{
 								v[yyy][xxx] = mad(x[yyy][i], k[i][xxx], v[yyy][xxx]);
 							}
 						}
 					}
 				}
 				GroupMemoryBarrierWithGroupSync();
 			}
 		}
 	}
 	for (uint yy = 0; yy < RTILE; ++yy)
 		for (uint xx = 0; xx < RTILE; ++xx)
 			O.Set(n*RTILE+yy, y, x, k*RTILE+xx, v[yy][xx]);
 	#undef X_
 	#undef K_
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/ConvOld.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/ConvOld.compute.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: a89bb2d7cde74429c8475f7cd8bcdb01
 ComputeShaderImporter:
  externalObjects: {}
  currentAPIMask: 0
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Dense.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Dense.compute
@ -0,0 +1,305 @@
 #pragma kernel Dense_L1Cached64
 #pragma kernel DenseTiled16x16
 //#pragma kernel DenseTiled32x32
 //#pragma kernel DenseTiled64x64
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(W)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 // NOTE: usually this path is used for <16 batches
 #undef CACHESIZE
 #define CACHESIZE 64
 groupshared float Dense_L1Cached64_X[CACHESIZE];
 [numthreads(CACHESIZE, 1, 1)]
 void Dense_L1Cached64(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth, O.flatHeight, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	#define X_ Dense_L1Cached64_X
 	uint x = CACHESIZE * groupID.x + groupThreadID.x;
 	uint y = groupID.y;
 	uint wIndex = W.Index(0, x);
 	float acc = B.Get(x);
 	// loop over X columns (flatWidth) and W rows (height) in CACHESIZE steps
 	for (uint i = 0; i < X.GetFlatWidth(); i += CACHESIZE)
 	{
 		// Cache X
 		// coalescent reads
 		X_[groupThreadID.x] = X.SafeGet(y, i + groupThreadID.x);
 		GroupMemoryBarrierWithGroupSync();
 		// X * W
 		if (i + CACHESIZE <= X.GetFlatWidth())
 		{
 			[unroll]
 			for (uint di = 0; di < CACHESIZE; ++di)
 			{
 				acc = fastfma(X_[di], W.data[wIndex], acc);
 				wIndex += W.GetFlatWidth();
 			}
 		}
 		else
 		{
 			// handle remainder of the line < CACHESIZE
 			for (uint di = 0; i + di < X.GetFlatWidth(); ++di)
 			{
 				acc = fastfma(X_[di], W.data[wIndex], acc);
 				wIndex += W.GetFlatWidth();
 			}
 		}
 		GroupMemoryBarrierWithGroupSync();
 	}
 	// needed all threads to load matrix line, x might be out of the bounds for writing
 	if (x < O.GetFlatWidth())
 		O.Set(y, x, acc);
 	#undef X_
 }
 #undef TILE_WIDTH
 #define TILE_WIDTH NUMTHREAD(16,8,8)
 groupshared float DenseTiled_Xcache[TILE_WIDTH][TILE_WIDTH];
 groupshared float DenseTiled_Wcache[TILE_WIDTH][TILE_WIDTH];
 [numthreads(TILE_WIDTH,TILE_WIDTH,1)]
 void DenseTiled16x16(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth, O.flatHeight, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	#define X_ DenseTiled_Xcache
 	#define W_ DenseTiled_Wcache
 	uint tx = groupThreadID.x;
 	uint ty = groupThreadID.y;
 	uint x = groupID.x*TILE_WIDTH + tx;
 	uint y = groupID.y*TILE_WIDTH + ty;
 	bool mask = (x < O.GetFlatWidth() && y < O.GetFlatHeight());
 	float v = B.Get(x);
 	for (uint m = 0; m < X.GetFlatWidth()/TILE_WIDTH; ++m)
 	{
 		if (mask)
 		{
 			X_[ty][tx] = X.Get(y, m*TILE_WIDTH + tx);
 			W_[ty][tx] = W.Get(m*TILE_WIDTH + ty, x);
 		}
 		else
 		{
 			X_[ty][tx] = 0;
 			W_[ty][tx] = 0;
 		}
 		GroupMemoryBarrierWithGroupSync();
 		[unroll]
 		for (uint i = 0; i < TILE_WIDTH; ++i)
 		{
 			v = fastfma(X_[ty][i], W_[i][tx], v);
 		}
 		GroupMemoryBarrierWithGroupSync();
 	}
 	if (mask)
 		O.Set(y, x, v);
 	#undef X_
 	#undef W_
 }
 #undef TILE_WIDTH
 #define TILE_WIDTH NUMTHREAD(16,8,8) // 32 crashes on MacBookPro/AMD
 groupshared float DenseTiled_Xcache32[2*2][TILE_WIDTH][TILE_WIDTH];
 groupshared float DenseTiled_Wcache32[2*2][TILE_WIDTH][TILE_WIDTH];
 [numthreads(TILE_WIDTH,TILE_WIDTH,1)]
 void DenseTiled32x32(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth / 2, O.flatHeight / 2, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	#define X_ DenseTiled_Xcache32
 	#define W_ DenseTiled_Wcache32
 	uint tx = groupThreadID.x;
 	uint ty = groupThreadID.y;
 	uint x = groupID.x*TILE_WIDTH + tx;
 	uint y = groupID.y*TILE_WIDTH + ty;
 	float b0 = B.Get(x*2+0);
 	float b1 = B.Get(x*2+1);
 	float4 v = float4(b0, b1,
 					  b0, b1);
 	for (uint m = 0; m < X.GetFlatWidth()/(TILE_WIDTH*2);)
 	{
 		float x0 = X.Get(y*2+0, m*TILE_WIDTH*2 + tx*2+0);
 		float x1 = X.Get(y*2+0, m*TILE_WIDTH*2 + tx*2+1);
 		float x2 = X.Get(y*2+1, m*TILE_WIDTH*2 + tx*2+0);
 		float x3 = X.Get(y*2+1, m*TILE_WIDTH*2 + tx*2+1);
 		float w0 = W.Get(m*TILE_WIDTH*2 + ty*2+0, x*2+0);
 		float w1 = W.Get(m*TILE_WIDTH*2 + ty*2+0, x*2+1);
 		float w2 = W.Get(m*TILE_WIDTH*2 + ty*2+1, x*2+0);
 		float w3 = W.Get(m*TILE_WIDTH*2 + ty*2+1, x*2+1);
 		++m;
 		X_[0][ty][tx] = x0;
 		X_[1][ty][tx] = x1;
 		X_[2][ty][tx] = x2;
 		X_[3][ty][tx] = x3;
 		W_[0][ty][tx] = w0;
 		W_[1][ty][tx] = w1;
 		W_[2][ty][tx] = w2;
 		W_[3][ty][tx] = w3;
 		GroupMemoryBarrierWithGroupSync();
 		[unroll]
 		for (uint i = 0; i < TILE_WIDTH; ++i)
 		{
 			float4 x =
 				float4(	X_[0][ty][i],
 						X_[1][ty][i],
 						X_[2][ty][i],
 						X_[3][ty][i]);
 			float4 w =
 				float4(	W_[0][i][tx],
 						W_[1][i][tx],
 						W_[2][i][tx],
 						W_[3][i][tx]);
 			v.x = fastfma(w.x, x.x, v.x);
 			v.y = fastfma(w.y, x.x, v.y);
 			v.z = fastfma(w.x, x.z, v.z);
 			v.w = fastfma(w.y, x.z, v.w);
 			v.x = fastfma(w.z, x.y, v.x);
 			v.y = fastfma(w.w, x.y, v.y);
 			v.z = fastfma(w.z, x.w, v.z);
 			v.w = fastfma(w.w, x.w, v.w);
 		}
 		GroupMemoryBarrierWithGroupSync();
 	}
 	O.Set(y*2+0, x*2+0, v.x);
 	O.Set(y*2+0, x*2+1, v.y);
 	O.Set(y*2+1, x*2+0, v.z);
 	O.Set(y*2+1, x*2+1, v.w);
 	#undef X_
 	#undef W_
 }
 #undef TILE_WIDTH
 #define TILE_WIDTH NUMTHREAD(16,8,8)
 groupshared float DenseTiled_Xcache64[4*4][TILE_WIDTH*TILE_WIDTH];
 groupshared float DenseTiled_Wcache64[4*4][TILE_WIDTH*TILE_WIDTH];
 [numthreads(TILE_WIDTH,TILE_WIDTH,1)]
 void DenseTiled64x64(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	DISPATCH_ARGS(O.flatWidth / 4, O.flatHeight / 4, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	#define X_ DenseTiled_Xcache64
 	#define W_ DenseTiled_Wcache64
 	uint tx = groupThreadID.x;
 	uint ty = groupThreadID.y;
 	uint x = groupID.x*TILE_WIDTH + tx;
 	uint y = groupID.y*TILE_WIDTH + ty;
 	float b0 = B.Get(x*4+0);
 	float b1 = B.Get(x*4+1);
 	float b2 = B.Get(x*4+2);
 	float b3 = B.Get(x*4+3);
 	float4 v0, v1, v2, v3;
 	v0 = v1 = v2 = v3 = float4(b0, b1, b2, b3);
 	for (uint m = 0; m < X.GetFlatWidth()/(TILE_WIDTH*4); ++m) 
 	{
 		for (uint yy = 0; yy < 4; ++yy)
 			for (uint xx = 0; xx < 4; ++xx)
 			{
 				X_[yy*4+xx][ty*TILE_WIDTH+tx] = X.Get(y*4+yy, (m*TILE_WIDTH + tx)*4+xx);
 				W_[yy*4+xx][ty*TILE_WIDTH+tx] = W.Get((m*TILE_WIDTH + ty)*4+yy, x*4+xx);
 			}
 		GroupMemoryBarrierWithGroupSync();
 		for (uint i = 0; i < TILE_WIDTH; ++i)
 		{
 			[unroll]
 			for (uint q = 0; q < 4; ++q)
 			{
 				float x0 = X_[0*4+q][ty*TILE_WIDTH+i];
 				float x1 = X_[1*4+q][ty*TILE_WIDTH+i];
 				float x2 = X_[2*4+q][ty*TILE_WIDTH+i];
 				float x3 = X_[3*4+q][ty*TILE_WIDTH+i];
 				float w0 = W_[q*4+0][i*TILE_WIDTH+tx];
 				float w1 = W_[q*4+1][i*TILE_WIDTH+tx];
 				float w2 = W_[q*4+2][i*TILE_WIDTH+tx];
 				float w3 = W_[q*4+3][i*TILE_WIDTH+tx];
 				v0.x = fastfma(x0, w0, v0.x); //--
 				v1.x = fastfma(x1, w0, v1.x);
 				v2.x = fastfma(x2, w0, v2.x);
 				v3.x = fastfma(x3, w0, v3.x);
 				v0.y = fastfma(x0, w1, v0.y); //--
 				v1.y = fastfma(x1, w1, v1.y);
 				v2.y = fastfma(x2, w1, v2.y);
 				v3.y = fastfma(x3, w1, v3.y);
 				v0.z = fastfma(x0, w2, v0.z); //--
 				v1.z = fastfma(x1, w2, v1.z);
 				v2.z = fastfma(x2, w2, v2.z);
 				v3.z = fastfma(x3, w2, v3.z);
 				v0.w = fastfma(x0, w3, v0.w); //--
 				v1.w = fastfma(x1, w3, v1.w);
 				v2.w = fastfma(x2, w3, v2.w);
 				v3.w = fastfma(x3, w3, v3.w);
 			}
 			GroupMemoryBarrierWithGroupSync();
 		}
 	}
 	O.Set(y*4+0, x*4+0, v0.x);
 	O.Set(y*4+0, x*4+1, v0.y);
 	O.Set(y*4+0, x*4+2, v0.z);
 	O.Set(y*4+0, x*4+3, v0.w);
 	O.Set(y*4+1, x*4+0, v1.x);
 	O.Set(y*4+1, x*4+1, v1.y);
 	O.Set(y*4+1, x*4+2, v1.z);
 	O.Set(y*4+1, x*4+3, v1.w);
 	O.Set(y*4+2, x*4+0, v2.x);
 	O.Set(y*4+2, x*4+1, v2.y);
 	O.Set(y*4+2, x*4+2, v2.z);
 	O.Set(y*4+2, x*4+3, v2.w);
 	O.Set(y*4+3, x*4+0, v3.x);
 	O.Set(y*4+3, x*4+1, v3.y);
 	O.Set(y*4+3, x*4+2, v3.z);
 	O.Set(y*4+3, x*4+3, v3.w);
 	#undef X_
 	#undef W_
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Dense.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Dense.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 6b08c0ac202ad41deb8881132b21894c
 timeCreated: 1507457322
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/DenseFP16.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/DenseFP16.compute
@ -0,0 +1,72 @@
 #pragma kernel DenseFP16Div2
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(W)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 float f16tof32_(uint src)
 {
    // Based on Fabian Giesen's public domain half_to_float_fast3
    const uint magic = 113 << 23;
    const uint shiftedExp = 0x7c00 << 13; // exponent mask after shift
    // Mask out sign bit
    uint o = src & 0x7fff;
    if (o)
    {
        // Move exponent + mantissa to correct bits
        o <<= 13;
        uint exponent = o & shiftedExp;
        if (exponent == 0)
        {
            // Handle denormal
            o = asuint(asfloat(o + magic) - asfloat(magic));
        }
        else if (exponent == shiftedExp) // Inf/NaN
            o += (255 - 31) << 23;
        else
            o += (127 - 15) << 23;
    }
    // Copy sign bit
    o |= (src & 0x8000) << 16;
    return asfloat(o);
 }
 float2 Unpack(SharedTensor t, uint y, uint x)
 {
 	uint v = asuint(t.data[t.Index(y, x) >> 1]);
 	// TEMPORARY: f16tof32 is broken in GLSL/Metal compiler
 	// using custom conversion function for now
 	//return float2(f16tof32(v), f16tof32(v>>16));
 	return float2(f16tof32_(v), f16tof32_(v>>16));
 }
 // NOTE: usually this path is used for <16 batches
 NUMTHREADS((256,1,1), (128,1,1), (64,1,1))
 void DenseFP16Div2(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
    DISPATCH_ARGS(O.flatWidth/2, O.flatHeight, 1);
    TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	uint x = dispatchThreadID.x;
 	uint y = dispatchThreadID.y;
    if (x*2 >= O.GetFlatWidth()) return;
    if (y >= O.GetFlatHeight()) return;
 	float2 acc = Unpack(B, 0, x*2);
 	for (uint i = 0; i < X.width; ++i)
 	{
 		float2 w = Unpack(W, i, x*2);
 		acc += X.Get(y, i) * w;
 	}
 	O.Set(y, x*2+0, acc[0]);
 	O.Set(y, x*2+1, acc[1]);
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/DenseFP16.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/DenseFP16.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: cff3cb66e54744fa4888ef91a11ec90c
 timeCreated: 1508334838
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Experimental.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Experimental.compute
--- a/Assets/Barracuda.Core/Barracuda/Resources/Experimental.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Experimental.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 299ca130202014274b506123e830c52d
 timeCreated: 1506672486
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/FastNV.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/FastNV.compute
@ -0,0 +1,188 @@
 //#pragma kernel Dense64
 //#pragma kernel Conv2D_Kernel3x3_64
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(W)
 TENSOR_DECL(K)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 uint4 _Pad;
 uint4 _Stride;
 #undef THREAD_COUNT
 #define THREAD_COUNT 64 // ATM support only 8x8
 #undef BLOCK_WIDTH
 #define BLOCK_WIDTH 8
 #undef LOAD_WIDTH
 #define LOAD_WIDTH THREAD_COUNT
 #undef LOAD_DEPTH
 #define LOAD_DEPTH BLOCK_WIDTH
 groupshared float DenseTiled_XcacheR[LOAD_DEPTH][LOAD_WIDTH];
 groupshared float DenseTiled_WcacheR[LOAD_DEPTH][LOAD_WIDTH];
 [numthreads(THREAD_COUNT, 1, 1)]
 void Dense64(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	// @TODO: DISPATCH_ARGS(...)
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	#define X_ DenseTiled_XcacheR
 	#define W_ DenseTiled_WcacheR
 	uint id = groupThreadID.x;
 	uint bx = groupID.x;
 	uint by = groupID.y;
 	uint bbx = id % BLOCK_WIDTH;
 	uint bby = id / BLOCK_WIDTH;
 	float v[BLOCK_WIDTH][BLOCK_WIDTH];
 	for (uint yy = 0; yy < BLOCK_WIDTH; ++yy)
 		for (uint xx = 0; xx < BLOCK_WIDTH; ++xx)
 		{
 			float bias = B.Get(bx*LOAD_WIDTH + bbx*BLOCK_WIDTH + xx);
 			v[yy][xx] = bias;
 		}
 	for (uint m = 0; m < X.GetFlatWidth()/LOAD_DEPTH; ++m)
 	{
 		for (uint q = 0; q < LOAD_DEPTH; ++q)
 		{
 			X_[q][id] = X.Get(by*LOAD_WIDTH + id, m*LOAD_DEPTH + q);
 			W_[q][id] = W.Get(m*LOAD_DEPTH + q, bx*LOAD_WIDTH + id);
 		}
 		GroupMemoryBarrierWithGroupSync();
 		for (uint yyy = 0; yyy < BLOCK_WIDTH; ++yyy)
 			[unroll] for (uint xxx = 0; xxx < BLOCK_WIDTH; ++xxx)
 				[unroll] for (uint i = 0; i < LOAD_DEPTH; ++i)
 				{
 					v[yyy][xxx] = mad(X_[i][bby*BLOCK_WIDTH + yyy], W_[i][bbx*BLOCK_WIDTH + xxx], v[yyy][xxx]);
 				}
 		GroupMemoryBarrierWithGroupSync();
 	}
 	for (uint yyy = 0; yyy < BLOCK_WIDTH; ++yyy)
 		for (uint xxx = 0; xxx < BLOCK_WIDTH; ++xxx)
 			O.Set(by*LOAD_WIDTH + bby*BLOCK_WIDTH + yyy, bx*LOAD_WIDTH + bbx*BLOCK_WIDTH + xxx, v[yyy][xxx]);
 	#undef X_
 	#undef W_
 }
 #undef THREAD_COUNT
 #define THREAD_COUNT 64 // ATM support only 8x8
 #undef BLOCK_WIDTH
 #define BLOCK_WIDTH 8
 #undef LOAD_WIDTH
 #define LOAD_WIDTH THREAD_COUNT
 #undef LOAD_DEPTH
 #define LOAD_DEPTH BLOCK_WIDTH
 groupshared float Conv_KcacheR[LOAD_DEPTH][LOAD_WIDTH];
 groupshared float Conv_XcacheR[LOAD_DEPTH][LOAD_WIDTH];
 [numthreads(THREAD_COUNT, 1, 1)]
 void Conv2D_Kernel3x3_64(uint3 groupID : SV_GroupID, uint3 groupThreadID : SV_GroupThreadID)
 {
 	// @TODO: DISPATCH_ARGS(...)
 	TENSOR_SHARED2_ARGS4(X, K, B, WBK, O);
 	#define X_ Conv_XcacheR
 	#define K_ Conv_KcacheR
 	uint id = groupThreadID.x;
 	uint bx = groupID.x;
 	uint by = groupID.y;
 	uint bbx = id % BLOCK_WIDTH;
 	uint bby = id / BLOCK_WIDTH;
 	uint width = O.width;
 	uint height = O.height;
 	// ASSERT(LOAD_WIDTH == THREAD_COUNT)
 	uint loadNYX = by*LOAD_WIDTH + id; // only works for 8x8
 	uint loadX = loadNYX % width;
 	uint loadNY = loadNYX / width;
 	uint loadY = loadNY % height;
 	uint loadN = loadNY / height;
 	// @TODO: validate that _Stride works, added the following 2 lines without testing
 	loadX *= _Stride.x;
 	loadY *= _Stride.y;
 	float v[BLOCK_WIDTH][BLOCK_WIDTH];
 	[unroll] for (uint yy = 0; yy < BLOCK_WIDTH; ++yy)
 		[unroll] for (uint xx = 0; xx < BLOCK_WIDTH; ++xx)
 		{
 			float bias = B.Get(bx*LOAD_WIDTH + bbx*BLOCK_WIDTH + xx);
 			v[yy][xx] = bias;
 		}
 	for (uint dy = 0; dy < 3; ++dy)
 	{
 		bool mask = true;
 		if (loadY+dy < _Pad.y) mask = false;
 		if (loadY+dy - _Pad.w >= X.height) mask = false;
 		for (uint dx = 0; dx < 3; ++dx)
 		{
 			if (loadX+dx < _Pad.x) mask = false;
 			if (loadX+dx - _Pad.z >= X.width) mask = false;
 			for (uint m = 0; m < X.channels/LOAD_DEPTH; ++m)
 			{
 				for (uint q = 0; q < LOAD_DEPTH; ++q)
 				{
 					if (mask)
 						X_[q][id] = X.Get(loadN, loadY+dy-_Pad.y, loadX+dx-_Pad.x, m*LOAD_DEPTH + q);
 					else
 						X_[q][id] = 0;
 					K_[q][id] = K.Get(dy, dx, m*LOAD_DEPTH + q, bx*LOAD_WIDTH + id);
 				}
 				GroupMemoryBarrierWithGroupSync();
 				for (uint yyy = 0; yyy < BLOCK_WIDTH; ++yyy)
 					[unroll] for (uint xxx = 0; xxx < BLOCK_WIDTH; ++xxx) 
 						[unroll] for (uint i = 0; i < LOAD_DEPTH; ++i)
 						{
 							v[yyy][xxx] += X_[i][bby*BLOCK_WIDTH + yyy] * K_[i][bbx*BLOCK_WIDTH + xxx];
 						}
 				GroupMemoryBarrierWithGroupSync();
 			}
 		}
 	}
 	[unroll] for (uint yyy = 0; yyy < BLOCK_WIDTH; ++yyy)
 		[unroll] for (uint xxx = 0; xxx < BLOCK_WIDTH; ++xxx)
 		{
 			uint saveNYX = by*LOAD_WIDTH + bby*BLOCK_WIDTH + yyy;
 			uint saveX = saveNYX % width;
 			uint saveNY = saveNYX / width;
 			uint saveY = saveNY % height;
 			uint saveN = saveNY / height;
 			uint saveK = bx*LOAD_WIDTH + bbx*BLOCK_WIDTH + xxx;
 			O.Set(saveN, saveY, saveX, saveK, v[yyy][xxx]);
 		}
 	#undef X_
 	#undef K_
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/FastNV.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/FastNV.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: c7c673db45e6845d5abaed4ed5ef42e1
 timeCreated: 1507294253
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Generic.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Generic.compute
@ -0,0 +1,339 @@
 #pragma kernel ScaleBias 	
 #pragma kernel ScaleBias_CNyx
 #pragma kernel Upsample2D
 #pragma kernel AvgPool2D
 #pragma kernel MaxPool2D
 #pragma kernel AvgPool2D_NoPads
 #pragma kernel MaxPool2D_NoPads
 //#pragma kernel MaxPool2D_Pool2x2_NoPads
 #pragma kernel GlobalAvgPool2D
 #pragma kernel InstanceNorm
 #pragma kernel Copy
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(W)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 uint4 _Pool;
 uint4 _Stride;
 uint4 _Pad;
 float _Alpha;
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void ScaleBias(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	float bias = B.Get(0, 0, 0, c);
 	float scale = W.Get(0, 0, 0, c);
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		v = v * scale + bias;
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((16,16,1), (16,8,1), (16,4,1))
 void ScaleBias_CNyx(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.batch * O.height * O.width, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	uint c = dispatchThreadID.x;
 	uint nyx = dispatchThreadID.y;
 	uint x = nyx % X.width;
 	uint ny = nyx / X.width;
 	uint y = ny % X.height;
 	uint n = ny / X.height;
 	if (c >= X.channels) return;
 	if (n >= X.batch) return;
 	float bias = B.Get(0, 0, 0, c);
 	float scale = W.Get(0, 0, 0, c);
 	float v = X.Get(n, y, x, c);
 	v = v * scale + bias;
 	O.Set(n, y, x, c, v);
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Upsample2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(X.channels, X.width, X.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= X.channels) return;
 	if (x >= X.width) return;
 	if (y >= X.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint oy = y * _Pool.y + dy;
 				uint ox = x * _Pool.x + dx;
 				O.Set(n, oy, ox, c, v);
 			}
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void MaxPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float maxV = -FLT_MAX;
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint oy = y * _Stride.y + dy;
 				uint ox = x * _Stride.x + dx;
 				bool mask = (oy >= _Pad.y) && (ox >= _Pad.x) && (oy - _Pad.w < X.height) && (ox - _Pad.z < X.width);
 				float v = (mask)? X.Get(n, oy - _Pad.y, ox - _Pad.x, c): 0;
 				maxV = max(v, maxV);
 			}
 		O.Set(n, y, x, c, maxV);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void AvgPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float acc = 0;
 		float counter = 0;
 		for (uint dy = 0; dy < _Pool.y; ++dy)
 			for (uint dx = 0; dx < _Pool.x; ++dx)
 			{
 				uint oy = y * _Stride.y + dy;
 				uint ox = x * _Stride.x + dx;
 				bool mask = (oy >= _Pad.y) && (ox >= _Pad.x) && (oy - _Pad.w < X.height) && (ox - _Pad.z < X.width);
 				acc += (mask)? X.Get(n, oy - _Pad.y, ox - _Pad.x, c): 0;
 				counter += (mask)? 1: 0;
 			}
 		acc /= counter;
 		O.Set(n, y, x, c, acc);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void MaxPool2D_NoPads(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float maxV = -FLT_MAX;
 		for (uint dy = 0; dy < _Pool[1]; ++dy)
 			for (uint dx = 0; dx < _Pool[0]; ++dx)
 			{
 				float v = X.Get(n, y * _Stride[1] + dy, x * _Stride[0] + dx, c);
 				maxV = max(v, maxV);
 			}
 		O.Set(n, y, x, c, maxV);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void AvgPool2D_NoPads(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	float invPoolSize = 1.0f / (_Pool[0] * _Pool[1]);
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = 0;
 		for (uint dy = 0; dy < _Pool[1]; ++dy)
 			for (uint dx = 0; dx < _Pool[0]; ++dx)
 				v += X.Get(n, y * _Stride[1] + dy, x * _Stride[0] + dx, c) * invPoolSize;
 		O.Set(n, y, x, c, v);
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 //NUMTHREADS((16,4,4), (16,4,2), (16,2,2))
 void MaxPool2D_Pool2x2_NoPads(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, O.width, O.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (c >= O.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v0 = X.Get(n, y*2,   x*2,   c);
 		float v1 = X.Get(n, y*2+1, x*2,   c);
 		float v2 = X.Get(n, y*2,   x*2+1, c);
 		float v3 = X.Get(n, y*2+1, x*2+1, c);
 		float v = max(v0, max(v1, max(v2, v3)));
 		O.Set(n, y, x, c, v);
 	}
 }
 [numthreads(32,1,1)]
 void GlobalAvgPool2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, 1, 1);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;
 	if (c >= O.channels) return;
 	//ASSERT(X.batch == O.batch)
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = 0;
 		for (uint y = 0; y < X.height; ++y)
 			for (uint x = 0; x < X.width; ++x)
 				v += X.Get(n, y, x, c);
 		v /= (X.height * X.width);
 		O.Set(n, 0, 0, c, v);
 	}
 }
 [numthreads(64,1,1)]
 void InstanceNorm(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	DISPATCH_ARGS(O.channels, 1, 1);
 	TENSOR_SHARED2_ARGS4(X, W, B, WBK, O);
 	uint c = dispatchThreadID.x;
 	if (c >= O.channels) return;
 	//ASSERT(X.shape == O.shape)
 	float gamma = W.Get(0, 0, 0, c);
 	float beta = B.Get(0, 0, 0, c);
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		uint x, y;
 		// calc mean
 		float acc = 0;
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 				acc += X.Get(n, y, x, c);
 		float mean = acc / (O.width * O.height);
 		// calc variance
 		acc = 0;
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 			{
 				float delta = X.Get(n, y, x, c) - mean;
 				acc += delta * delta;
 			}
 		float var = acc / (O.width * O.height);
 		// normalization factor
 		float invNormFactor = 1 / sqrt(var + FLT_EPSILON);
 		float scale = gamma * invNormFactor;
 		float bias = beta - gamma * mean * invNormFactor;
 		// apply normalization
 		for (y = 0; y < O.height; ++y)
 			for (x = 0; x < O.width; ++x)
 			{
 				float v = X.Get(n, y, x, c);
 				v = v * scale + bias;
 				O.Set(n, y, x, c, v);
 			}
 	}
 }
 NUMTHREADS((4,8,8), (4,8,4), (4,4,4))
 void Copy(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 	// NOTE: dispatched over X (not O)
 	DISPATCH_ARGS(X.channels, X.width, X.height);
 	TENSOR_ARGS2(X, O);
 	uint c = dispatchThreadID.x;	uint x = dispatchThreadID.y;	uint y = dispatchThreadID.z;
 	if (c >= X.channels) return;	if (x >= X.width) return;		if (y >= X.height) return;
 	for (uint n = 0; n < X.batch; ++n)
 	{
 		float v = X.Get(n, y, x, c);
 		O.Set(n + _Pad[0], y + _Pad[1], x + _Pad[2], c + _Pad[3], v);
 	}
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Generic.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Generic.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 62f5efacd43b24dd38ead3ce0d80cc34
 timeCreated: 1495527718
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Random.cginc
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Random.cginc
@ -0,0 +1,70 @@
 // Based on: https://stackoverflow.com/questions/5149544/can-i-generate-a-random-number-inside-a-pixel-shader
 // Output: Random number: [0,1), that is between 0.0 and 0.999999... inclusive.
 // Author: Michael Pohoreski
 // Copyright: Copyleft 2012 :-)
 float RandomUsingCos(float4 seed)
 {
 	float4 K1 = float4(		// Transcendental numbers:
 		0.64341054629,     	// (Cahen's constant)
 		23.14069263277926,	// e^pi (Gelfond's constant)
 		2.665144142690225,	// 2^sqrt(2) (Gelfond-Schneider constant)
 		3.14159265359		// pi
 	);
 	return frac(cos(dot(seed, K1)) * 12345.6789);
 }
 // Based on: https://stackoverflow.com/questions/4200224/random-noise-functions-for-glsl
 // Author: Spatial
 // 05 July 2013
 // A single iteration of Bob Jenkins' One-At-A-Time hashing algorithm.
 uint hash(uint x)
 {
 	x += ( x << 10u );
 	x ^= ( x >>  6u );
 	x += ( x <<  3u );
 	x ^= ( x >> 11u );
 	x += ( x << 15u );
 	return x;
 }
 uint hash( uint2 v ) { return hash( v.x ^ hash(v.y)                         ); }
 uint hash( uint3 v ) { return hash( v.x ^ hash(v.y) ^ hash(v.z)             ); }
 uint hash( uint4 v ) { return hash( v.x ^ hash(v.y) ^ hash(v.z) ^ hash(v.w) ); }
 // Construct a float with half-open range [0:1] using low 23 bits.
 // All zeroes yields 0.0, all ones yields the next smallest representable value below 1.0.
 float floatConstruct(uint m)
 {
 	const uint ieeeMantissa = 0x007FFFFFu;	// binary32 mantissa bitmask
 	const uint ieeeOne      = 0x3F800000u;	// 1.0 in IEEE binary32
 	m &= ieeeMantissa;						// Keep only mantissa bits (fractional part)
 	m |= ieeeOne;							// Add fractional part to 1.0
 	float  f = asfloat(m);					// Range [1:2]
 	return f - 1.0;							// Range [0:1]
 }
 // Pseudo-random value in half-open range [0:1].
 float RandomUsingHash(float4 seed)
 {
 	return floatConstruct(hash(asuint(seed)));
 }
 // More alternatives:
 // https://github.com/ashima/webgl-noise
 // https://www.shadertoy.com/view/4djSRW
 // ------------------------------------------------------------------------------------------
 float Random(float4 seed)
 {
 	return RandomUsingCos(seed);
 }
 float Bernoulli(float4 seed, float p)
 {
 	return Random(seed) <= p ? 1: 0;
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Random.cginc.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Random.cginc.meta
@ -0,0 +1,10 @@
 fileFormatVersion: 2
 guid: 5a17e0b3943a74564a02a8ed0a41228b
 timeCreated: 1520855309
 licenseType: Pro
 ShaderImporter:
  externalObjects: {}
  defaultTextures: []
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/Tensor.cginc
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Tensor.cginc
@ -0,0 +1,311 @@
 #define BARRACUDA_MAX_THREAD_COUNT 64
 #if (BARRACUDA_MAX_THREAD_COUNT>=256)
 #define NUMTHREADS(t256,t128,t64) [numthreads t256]
 #define NUMTHREAD(t256, t128, t64) t256
 #elif (BARRACUDA_MAX_THREAD_COUNT>=128)
 #define NUMTHREADS(t256,t128,t64) [numthreads t128]
 #define NUMTHREAD(t256,t128,t64) t128
 #elif (BARRACUDA_MAX_THREAD_COUNT>=64)
 #define NUMTHREADS(t256,t128,t64) [numthreads t64]
 #define NUMTHREAD(t256,t128,t64) t64
 #endif
 struct Tensor
 {
 	// @TODO: actually uint seems not like a good idea anymore, consider going to int
 	uint batch, height, width, channels;
 	void Init(uint4 nhwc)
 	{
 		batch = nhwc.x;
 		height = nhwc.y;
 		width = nhwc.z;
 		channels = nhwc.w;
 	}
 	uint4 Dims()
 	{
 		return uint4(batch, height, width, channels);
 	}
 	uint GetFlatHeight()
 	{
 		return batch;
 	}
 	uint GetFlatWidth()
 	{
 		return height * width * channels;
 	}
 	uint GetKernelHeight()
 	{
 		// kernels storage: {kernel_width * kernel_height * kernel_channels * kernel_count}
 		uint kernelHeight = batch;
 		return kernelHeight;
 	}
 	uint GetKernelWidth()
 	{
 		// kernels storage: {kernel_width * kernel_height * kernel_channels * kernel_count}
 		uint kernelWidth = height;
 		return kernelWidth;
 	}
 	uint Index(uint b, uint h, uint w, uint ch)
 	{
 		uint index =
 			b * height * width * channels +
 			h * width * channels +
 			w * channels +
 			ch;
 		return index;
 	}
 	uint Index(uint b, uint i)
 	{
 		uint index =
 			b * height * width * channels +
 			i;
 		return index;
 	}
 };
 struct ReadonlyTensor : Tensor
 {
 	StructuredBuffer<float> data;
 	void Init(uint4 nhwc, StructuredBuffer<float> data_)
 	{
 		Tensor::Init(nhwc);
 		data = data_;
 	}
 	float Get(uint b, uint h, uint w, uint ch)
 	{
 		return data[Index(b,h,w,ch)];
 	}
 	float Get(uint b, uint2 pos, uint ch)
 	{
 		return data[Index(b, pos.y, pos.x, ch)];
 	}
 	float Get(uint b, uint i)
 	{
 		return data[Index(b,i)];
 	}
 	float Get(uint i)
 	{
 		return data[i];
 	}
 	float BroadcastGet(uint b, uint h, uint w, uint ch)
 	{
 		return Get(b % batch, h % height, w % width, ch % channels);
 	}
 	float BroadcastGet(uint b, uint2 pos, uint ch)
 	{
 		return BroadcastGet(b, pos.y, pos.x, ch);
 	}
 	float BroadcastGet(uint b, uint i)
 	{
 		return Get(b % GetFlatHeight(), i % GetFlatWidth());
 	}
 	float SafeGet(uint b, uint2 pos, uint ch, uint2 pad)
 	{
 		if (b >= batch || ch >= channels) return 0;
 		if (any(pos < pad)) return 0;
 		if (any(pos >= uint2(width, height) + pad)) return 0;
 		pos -= pad;
 		return data[Index(b, pos.y, pos.x, ch)];
 	}
 	float SafeGet(uint b, uint h, uint w, uint ch, uint2 pad)
 	{
 		return SafeGet(b, uint2(w, h), ch, pad);
 	}
 	float SafeGet(uint b, uint i)
 	{
 		if (b >= batch || i >= height * width * channels) return 0;
 		return Get(b,i);
 	}
 	float SafeGet(uint i)
 	{
 		if (i >= batch * height * width * channels) return 0;
 		return Get(i);
 	}
 };
 struct ReadWriteTensor : Tensor
 {
 	RWStructuredBuffer<float> data;
 	void Init(int4 nhwc, RWStructuredBuffer<float> data_)
 	{
 		Tensor::Init(nhwc);
 		data = data_;
 	}
 	float Get(uint b, uint h, uint w, uint ch)
 	{
 		return data[Index(b,h,w,ch)];
 	}
 	float Get(uint b, uint2 pos, uint ch)
 	{
 		return data[Index(b, pos.y, pos.x, ch)];
 	}
 	float Get(uint b, uint i)
 	{
 		return data[Index(b,i)];
 	}
 	float Get(uint i)
 	{
 		return data[i];
 	}
 	float BroadcastGet(uint b, uint h, uint w, uint ch)
 	{
 		return Get(b % batch, h % height, w % width, ch % channels);
 	}
 	float BroadcastGet(uint b, uint2 pos, uint ch)
 	{
 		return BroadcastGet(b, pos.y, pos.x, ch);
 	}
 	float BroadcastGet(uint b, uint i)
 	{
 		return Get(b % GetFlatHeight(), i % GetFlatWidth());
 	}
 	float SafeGet(uint b, uint2 pos, uint ch, uint2 pad)
 	{
 		if (b >= batch || ch >= channels) return 0;
 		if (any(pos < pad)) return 0;
 		if (any(pos >= uint2(width, height) + pad)) return 0;
 		pos -= pad;
 		return Get(b, pos.y, pos.x, ch);
 	}
 	float SafeGet(uint b, uint h, uint w, uint ch, uint2 pad)
 	{
 		return SafeGet(b, uint2(w, h), ch, pad);
 	}
 	float SafeGet(uint b, uint i)
 	{
 		if (b >= batch || i >= height * width * channels) return 0;
 		return Get(b,i);
 	}
 	float SafeGet(uint i)
 	{
 		if (i >= batch * height * width * channels) return 0;
 		return Get(i);
 	}
 	void Set(uint b, uint h, uint w, uint ch, float v)
 	{
 		data[Index(b,h,w,ch)] = v;
 	}
 	void Set(uint y, uint x, float v)
 	{
 		data[Index(y,x)] = v;
 	}
 	void Set(uint i, float v)
 	{
 		data[i] = v;
 	}
 };
 struct SharedTensor : Tensor
 {
 	StructuredBuffer<float> data;
 	uint offset;
 	void Init(uint4 nhwc, uint4 info, StructuredBuffer<float> data_)
 	{
 		Tensor::Init(nhwc);
 		data = data_;
 		offset = info.x;
 	}
 	float Get(uint b, uint h, uint w, uint ch)
 	{
 		return data[Index(b,h,w,ch) + offset];
 	}
 	float Get(uint b, uint2 pos, uint ch)
 	{
 		return Get(b, pos.y, pos.x, ch);
 	}
 	float Get(uint b, uint i)
 	{
 		return data[Index(b,i) + offset];
 	}
 	float Get(uint i)
 	{
 		return data[i + offset];
 	}
 	float BroadcastGet(uint b, uint h, uint w, uint ch)
 	{
 		return Get(b % batch, h % height, w % width, ch % channels);
 	}
 	float BroadcastGet(uint b, uint2 pos, uint ch)
 	{
 		return BroadcastGet(b, pos.y, pos.x, ch);
 	}
 	float BroadcastGet(uint b, uint i)
 	{
 		return Get(b % GetFlatHeight(), i % GetFlatWidth());
 	}
 	float SafeGet(uint b, uint2 pos, uint ch, uint2 pad)
 	{
 		if (b >= batch || ch >= channels) return 0;
 		if (any(pos < pad)) return 0;
 		if (any(pos >= uint2(width, height) + pad)) return 0;
 		pos -= pad;
 		return Get(b, pos, ch);
 	}
 	float SafeGet(uint b, uint h, uint w, uint ch, uint2 pad)
 	{
 		return SafeGet(b, uint2(w, h), ch, pad);
 	}
 	float SafeGet(uint b, uint i)
 	{
 		if (b >= batch || i >= height * width * channels) return 0;
 		return Get(b,i);
 	}
 	float SafeGet(uint i)
 	{
 		if (i >= batch * height * width * channels) return 0;
 		return Get(i);
 	}
 };
 #define TENSOR_DECL(X) uint4 X##decl[2]; StructuredBuffer<float> X##data;
 #define TENSOR_DECL_RW(X) uint4 X ## decl[2]; RWStructuredBuffer<float> X ## data;
 #define TENSOR_ARG(X) ReadonlyTensor X; X##.Init(X##decl[0], X##data); // readonly
 #define TENSOR_MODEL(X) SharedTensor X; X##.Init(X##decl[0], X##decl[1], X##data); // RO w offset
 #define TENSOR_ARG_RW(X) ReadWriteTensor X; X##.Init(X##decl[0], X##data);
 #define TENSOR_ARGS2(X, O) TENSOR_ARG(X); TENSOR_ARG_RW(O);
 #define TENSOR_ARGS3(X, A, O) TENSOR_ARG(X); TENSOR_MODEL(A); TENSOR_ARG_RW(O);
 #define TENSOR_ARGS4(X, A, B, O) TENSOR_ARG(X); TENSOR_MODEL(A); TENSOR_MODEL(B); TENSOR_ARG_RW(O);
 // shared model tensors
 #define TENSOR_SHARED_MODEL(X, S) SharedTensor X; X##.Init(X##decl[0], X##decl[1], S##data);
 #define TENSOR_SHARED2_ARGS4(X, A, B, S, O) TENSOR_ARG(X); TENSOR_SHARED_MODEL(A, S); TENSOR_SHARED_MODEL(B, S); TENSOR_ARG_RW(O);
 // purely informational - declares contract between caller of Dispatch() and kernel
 #define DISPATCH_ARGS(threadGroupsX, threadGroupsY, threadGroupsZ)
 // @TODO: move into more appropriate file
 #define FLT_MAX 3.402823466e+38F
 #define FLT_EPSILON 1e-6
 float fastfma(float a, float b, float c)
 {
 	return dot(float2(a,c), float2(b, 1));
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/Tensor.cginc.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/Tensor.cginc.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 5761abd87a16940b2a81aaa755787fc9
 timeCreated: 1506540305
 licenseType: Pro
 ShaderImporter:
  defaultTextures: []
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/Barracuda/Resources/TexConv.compute
+++ b/Assets/Barracuda.Core/Barracuda/Resources/TexConv.compute
@ -0,0 +1,99 @@
 #pragma kernel TexConv2D
 #include "Tensor.cginc"
 TENSOR_DECL(X)
 TENSOR_DECL(K)
 TENSOR_DECL(B)
 TENSOR_DECL(WBK)
 TENSOR_DECL_RW(O)
 uint4 _Pad;
 uint4 _Stride;
 struct TextureAsTensor : Tensor
 {
 	Texture2D<float4> tex;
 	SamplerState smp;
 	Texture2DArray<float4> texArray;
 	SamplerState smpArray;
 	void Init(uint4 nhwc, Texture2D<float4> tex_, SamplerState sampler_, Texture2DArray<float4> texArray_, SamplerState samplerArray_)
 	{
 		Tensor::Init(nhwc);
 		tex = tex_;
 		smp = sampler_;
 		texArray = texArray_;
 		smpArray = samplerArray_;
 	}
 	float4 Get(uint b, uint y, uint x)
 	{
 		float3 loc = float3((float)x / (float)width, (float)y / (float)height, b);
 		if (batch > 1)
 			return texArray.SampleLevel(smpArray, loc, 0);
 		else
 			return tex.SampleLevel(smp, loc.xy, 0);
 	}
 };
 #define TENSOR_SHARED2_ARGS3(A, B, S, O) TENSOR_SHARED_ARG(A, S); TENSOR_SHARED_ARG(B, S); TENSOR_ARG_RW(O);
 Texture2DArray<float4> Xtex2DArray;
 Texture2D<float4> Xtex2D;
 SamplerState samplerXtex2D { Filter = MIN_MAG_LINEAR_MIP_POINT; AddressU = Clamp; AddressV = Clamp; };
 SamplerState samplerXtex2DArray { Filter = MIN_MAG_LINEAR_MIP_POINT; AddressU = Clamp; AddressV = Clamp; };
 #define MAX_CHANNELS 4
 NUMTHREADS((16,4,4), (16,4,2), (16,2,2))
 void TexConv2D(uint3 dispatchThreadID : SV_DispatchThreadID)
 {
 // @TODO: currently it fails to compile, needs to be investigated
 #if 0
 	DISPATCH_ARGS(K.kernelCount, O.width, O.height);
 	TextureAsTensor X; X.Init(Xdecl[0], Xtex2D, samplerXtex2D, Xtex2DArray, samplerXtex2DArray);
 	TENSOR_SHARED_ARG(K, WBK);
 	TENSOR_SHARED_ARG(B, WBK);
 	TENSOR_ARG_RW(O);
 	// ASSERT(X.channels <= MAX_CHANNELS)
 	uint k = dispatchThreadID.x;
 	uint x = dispatchThreadID.y;
 	uint y = dispatchThreadID.z;
 	if (k >= K.channels) return;
 	if (x >= O.width) return;
 	if (y >= O.height) return;
 	for (uint n = 0; n < O.batch; ++n)
 	{
 		float acc = B.Get(k);
 		for (uint dy = 0; dy < K.GetKernelHeight(); ++dy)
 		{
 			for (uint dx = 0; dx < K.GetKernelWidth(); ++dx)
 			{
 				uint oy = y * _Stride.y + dy;
 				uint ox = x * _Stride.x + dx;
 				// @TODO: investigate
 				// WARNING: had to move both y check into the loop (as opposed to checking y in parent loop) - due to potential bug in Metal compiler
 				if (oy < _Pad.y) continue;
 				if (oy - _Pad.w >= X.height) continue;
 				if (ox < _Pad.x) continue;
 				if (ox - _Pad.z >= X.width) continue;
 				float4 in4channels = X.Get(n, oy - _Pad.y, ox - _Pad.x);
 				for (uint c = 0; c < X.channels && c < MAX_CHANNELS; ++c)
 				{
 					acc += in4channels[c] * K.Get(dy, dx, c, k);
 				}
 			}
 		}
 		O.Set(n, y, x, k, acc);
 	}
 #endif
 }
--- a/Assets/Barracuda.Core/Barracuda/Resources/TexConv.compute.meta
+++ b/Assets/Barracuda.Core/Barracuda/Resources/TexConv.compute.meta
@ -0,0 +1,9 @@
 fileFormatVersion: 2
 guid: 85d38d76f835143f797bca1481285596
 timeCreated: 1507637303
 licenseType: Pro
 ComputeShaderImporter:
  currentAPIMask: 196608
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/LICENSE.md
+++ b/Assets/Barracuda.Core/LICENSE.md
@ -0,0 +1,6 @@
 Barracuda cross-platform Neural Net engine copyright © 2018 Unity Technologies ApS
 Licensed under the Unity Companion License for Unity-dependent projects--see [Unity Companion License](http://www.unity3d.com/legal/licenses/Unity_Companion_License). 
 Unless expressly provided otherwise, the Software under this license is made available strictly on an “AS IS” BASIS WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. Please review the license for details on these and other terms and conditions.
--- a/Assets/Barracuda.Core/LICENSE.md.meta
+++ b/Assets/Barracuda.Core/LICENSE.md.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: dcc5ce8caa7664f8090ef0103a208c6e
 TextScriptImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/ReleaseNotes.md
+++ b/Assets/Barracuda.Core/ReleaseNotes.md
@ -0,0 +1,82 @@
 # Release notes
 ## 0.1.6
 - Added activation type print in verbose mode
 - Added fast and parallel CPU implementation for Swish, Relu, Add, Sub, Div, Min, Max, Tanh, Exp
 - Removed duplicate profiler blocks for ops
 - Improved scheduling on CPU for small batches of data
 - Fixed compatibility with Unity 2019.2.x
 ## 0.1.5
 - Added Transpose, MatMul and Indentity layer support for models exported from ONNX.
 - Added BasicLSTM layer support for models exported from TF. Limited set of LSTM networks should work now.
 - Added DepthwiseConv2D layer support. Most of the networks based on the MobileNet should work now.
 - Added OneHot layer support for models exported from TF.
 - Added optimized path for Conv2D, Dense and Transpose layers with single batch executions. Performance gain up to 100%.
 - Fixed FMA performance issue on Metal GFX platforms.
 - Added fast optimized path for Sigmoid and Mul layers on CPU.
 - Fixed issue when worker is executed with different batch sizes.
 - Added ``pip`` requirements file for Python dependencies, check ``Tools/requirements.txt```.
 - Added proof of concept Docker wrappers for running model conversion inside of Docker container. Check ``Tools/docker-tensorflow-to-barracuda.sh`` and ``Tools/docker-onnx-to-barracuda.sh``. Currently it was tested only on Mac host.
 - Refactored model importers for easier integration with ML Agents.
 - Fixed input shape determination for Keras sequential model.
 - Added metadata about input shapes to model. Look for ``Model.GetShapeByName()``.
 - Added API to query constant Tensors embedded into network, look for ``Model.GetTensorByName()``.
 - Added reference implementations for Selu, Abs, Neg, Ceil, Floor, Clip, Rcp, Log layers.
 - Added support for Mean, Square, StridedSlice and Border2D layers.
 - Added support for Swish activation, now it is automatically detected in models.
 - Fixed Tanh NaN issue when large argument is passed.
 - RandomNormal and RandomUniform now supports either embedded shape constant OR previous tensor shape for input.
 - Fixed Keras/TF/ONNX FusedBatchNorm/BatchNorm import and now it takes ``epsilon`` into account.
 - Now Barracuda will fallback to CSharpFast if compute shaders are not supported on the current platform.
 - Improved compute kernel interop on Android.
 - Implemented Pix2Pix model (.pict) importer.
 ## 0.1.4
 - Implemented fast Conv2DTrans. Useful for GAN type networks.
 - Fixed few ComputeBuffer handling issues.
 - Simplified way to pass texture via ``Tensor`` constructor.
 - Documentation improvements.
 - Added Unity Companion License as part of distribution.
 - Fixed boundary checks for Compute Copy/Concat operations.
 - Improved profiling experience, now each layer will be reported separately in Unity Profiler.
 - Fixed Broadcast layer support in ``ModelAnalyzer``.
 - Exp, Pow and other layers are now also implemented in Compute. Improves RL model inference performance on GPU.
 - Added platform specific BLAS plugin support. Out of the box Barracuda ships with Apple Accelerate framework support for iOS and macOS.
 - Added Burst BLAS plugin, greatly improves performance in Unity Editor where native OS BLAS is not available. It's packaged as separate package and requires to have Burst enabled.
 - Improved memory handling, now less GC allocations should be made per inference execution.
 ## 0.1.3
 - Improved Barracuda support for Unity Profiler.
 - Cleaned up Barracuda APIs.
 - Added direct ``Texture`` input support. Look for ``TextureAsTensorData``. The following types of texture supported as input: ``Texture2D``, ``Texture2DArray``, ``Texture3D``, ``RenderTexture``.
 - Added ``Tensor`` to ``RenderTexture`` conversion. Look for ``TensorToRenderTexture``.
 - Autoencoder type networks can run completely on GPU now. Data roundtrip via CPU is not necessary anymore.
 - Vertical flip is applied when converting between ``Texture`` and ``Tensor`` to match conventionts. To override this behavior look for ``TextureAsTensorData.Flip`` enum.
 - Removed direct reference to WebCamTexture, now Barracuda compiles for Console targets.
 - Fixed _Conv2DTranspose_ layer support. Now GANs using _Conv2DTranspose_ work properly.
 - Added automated test for pix2pix GAN.
 ## 0.1.2
 - Barracuda now is also available as preview package. Look for ``com.unity.barracuda`` in https://staging-packages.unity.com registry.
 - Conv2D layers are now *up to 30x faster* with ``CSharpFast`` backend (``ComputeFast`` remains best backend for convolutional networks).
 - Added profiler sample for ``Fetch()``.
 - Fixed compilation issues on Xbox One.
 - TexConv2D support was temporary disabled.
 - Barracuda logging now can be configured via static fields of ``Barracuda.D`` class, it allows both disable specific logging levels or just disable stack trace collection (helps with performance when profiling).
 - Compute Concat implementation now will fall back to C# implementation instead of throwing exception when unsupported configuration is encountered. 
 - Fixed several ``ComputeBuffer`` release issues. 
 - Added constructor for ``Tensor`` that allows to pass in data array.
 - Improved Flatten handling in TensorFlow models.
 - Added helper func ``ModelLoader.LoadFromStreamingAssets``.
 - Fixed .meta file packaging.
 - Small docs improvements.
 - Fixed unnecessary patching of Activation layers in ``ModelLoader``.
 - Added output trimming at run-time. See for extra parameters Worker factory.
 ## 0.1.1
 - First internal realease as drop-in package
 - Compatibility with ML Agents models: 3DBall, PushBlock, GridWorld, Soccer.
 ## 0.1.0
 - First internal build. Due some bugs encountered wasn't published.
--- a/Assets/Barracuda.Core/ReleaseNotes.md.meta
+++ b/Assets/Barracuda.Core/ReleaseNotes.md.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: a129912fffc9d4ab3b5ae110be67a669
 TextScriptImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/Barracuda.Core/package.json
+++ b/Assets/Barracuda.Core/package.json
@ -0,0 +1,8 @@
 {
    "name": "com.unity.barracuda",
    "displayName": "Barracuda",
    "version": "0.1.6-preview",
    "unity": "2017.4",
    "description": "Barracuda is lightweight and cross-platform Neural Net inference library. Barracuda supports inference both on GPU and CPU.",
    "dependencies": {}
 }
--- a/Assets/Barracuda.Core/package.json.meta
+++ b/Assets/Barracuda.Core/package.json.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 73ae2d877fd444b04b5b6ef591d3fa0e
 TextScriptImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0.meta
+++ b/Assets/ECS_MLAgents_v0.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: a69633ced4cc74b0d9a9af7e6f27e92d
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core.meta
+++ b/Assets/ECS_MLAgents_v0/Core.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 7621aa5732574c9689c6603d4f50331b
 timeCreated: 1548470002
--- a/Assets/ECS_MLAgents_v0/Core/Agent.cs
+++ b/Assets/ECS_MLAgents_v0/Core/Agent.cs
@ -0,0 +1,19 @@
 using System;
 using Unity.Entities;
 using Unity.Mathematics;
 namespace ECS_MLAgents_v0.Core
 {
    /*
     * This is the Agent Component, it contains information specific to the Agent such as the
     * reward signal and the done flag.
     */
    [Serializable]
    public struct Agent : IComponentData
    {
        // TODO : Add the Agent IComponentData to the appropriate Entities before the first
        // decision pass
        public float3 Reward;
 //        public bool1 Done; // TODO : bool is not blittable 
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/Agent.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Agent.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: f701588218d34109a497b0deab92af6b
 timeCreated: 1548470131
--- a/Assets/ECS_MLAgents_v0/Core/AgentComponent.cs
+++ b/Assets/ECS_MLAgents_v0/Core/AgentComponent.cs
@ -0,0 +1,10 @@
 using Unity.Entities;
 namespace ECS_MLAgents_v0.Core
 {
    /*
     * This is the ComponentDataWrapper for the Agent Component. It allows to attach an Agent Component
     * to a GameObject in the Unity Editor.
     */
    public class AgentComponent : ComponentDataWrapper<Agent> { }
 }
--- a/Assets/ECS_MLAgents_v0/Core/AgentComponent.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/AgentComponent.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 4f2a8abc5e5549439b29a0f9cbb7776b
 timeCreated: 1548382152
--- a/Assets/ECS_MLAgents_v0/Core/AgentSystem.cs
+++ b/Assets/ECS_MLAgents_v0/Core/AgentSystem.cs
@ -0,0 +1,228 @@
 using System.Linq;
 using Unity.Collections;
 using Unity.Collections.LowLevel.Unsafe;
 using Unity.Entities;
 using Unity.Jobs;
 using UnityEngine;
 namespace ECS_MLAgents_v0.Core
 {
    /*
     * AgentSystem<Sensor, Actuator> is a JobComponentSystem that updates the Actuator based of
     * the data present in Sensor for all of the compatible Entities. The user can create a new
     * AgentSystem by defining a class this way :
     *
     *     public class MyAgentSystem : AgentSystem<MySensor, MyActuator> { }
     *
     * The user can modify properties of MyAgentSystem to modify which Entities will be
     * affected by MyAgentSystem.
     *
     * To access the instance of MyAgentSystem, use :
     * 
     *     World.Active.GetExistingManager<MyAgentSystem>(); 
     * 
     * It is the responsibility of the user to create and populate
     * the MySensor of each Entity as well as create and use the data in the MyActuator of each
     * Entity. MySensor and MyActuator must be IComponentData struct that only contains blittable
     * float fields
     * Note that an Agent IComponentData must be attached to a Entity to be affected by
     * MyAgentSystem.
     *
     * At each call to OnUpdate, the Data from the sensors of compatible entities will be
     * aggregated into a single NativeArray<float>. The AgentSystem will then process this
     * data in batch and generate a new NativeArray<float> that will be used to populate the
     * Actuator data of all compatible Entities.
     */
    public abstract class AgentSystem<TS, TA> : JobComponentSystem, IAgentSystem
        where TS : struct, IComponentData
        where TA : struct, IComponentData
    {   
        private const int INITIAL_MEMORY_SIZE = 1024;
        private const int SIZE_OF_FLOAT_IN_MEMORY = 4;
        private int _sensorMemorySize = INITIAL_MEMORY_SIZE;
        private int _actuatorMemorySize = INITIAL_MEMORY_SIZE;
        public int DecisionInterval { get; set; }
        private int _phase;
        public IAgentDecision Decision { get; set; }
        private ComponentGroup _componentGroup;
        private int _sensorSize;
        private int _actuatorSize;
        // TODO : Make sure there is not extra cost for memory allocation here and when copying
        private NativeArray<float> _sensorTensor =
            new NativeArray<float>(INITIAL_MEMORY_SIZE, Allocator.Persistent);
        private NativeArray<float> _actuatorTensor =
            new NativeArray<float>(INITIAL_MEMORY_SIZE, Allocator.Persistent);
        // TODO : Decide if we want to keep at all
        private Logger _logger;
        protected override void OnCreateManager()
        {
            _logger = new Logger(GetType().Name);
            _logger.Log("OnCreateManager");
            SetNewComponentGroup();
            _sensorSize = UnsafeUtility.SizeOf<TS>();
            _actuatorSize = UnsafeUtility.SizeOf<TA>();
        }
        protected override void OnDestroyManager()
        {
            _logger.Log("OnDestroyManager");
            _sensorTensor.Dispose();
            _actuatorTensor.Dispose();
        }
        public void SetNewComponentGroup(params ComponentType[] t)
        {
            _logger.Log("UpdateComponentGroup");
            var componentTypes = t.ToList();
            componentTypes.Add(ComponentType.ReadOnly(typeof(TS)));
            componentTypes.Add(typeof(TA));
            componentTypes.Add(typeof(Agent));
            _componentGroup = GetComponentGroup(componentTypes.ToArray());
        }
        public void SetFilter<T>(T filter) where T : struct, ISharedComponentData
        {
            _componentGroup.SetFilter<T>(filter);
        }
        public void SetFilter<T0, T1>(T0 filterA, T1 filterB) 
            where T0 : struct, ISharedComponentData
            where T1 : struct, ISharedComponentData
        {
            _componentGroup.SetFilter<T0, T1>(filterA, filterB);
        }
        public void ResetFilter() 
        {
            _componentGroup.ResetFilter();
        }
        protected override JobHandle OnUpdate(JobHandle inputDeps)
        {
            _logger.Log("OnUpdate");
            if (_phase > 0)
            {
                _phase--;
                return inputDeps;
            }
            _phase = DecisionInterval;
            var nAgents = _componentGroup.CalculateLength();
            /*
             * If the AgentSystem is not active or if there is no Decision component on the
             * AgentSystem or if no Entities match the ComponentGroups' requirement, the Update
             * of the Actuators returns immediately.
             */
            if (Decision == null || nAgents == 0)
            {
                return inputDeps;
            }
            /*
             * If there was more agents than allowed by the memory allocation of the sensor or
             * actuator, then the size is updated to the required size.
             */
            if (nAgents * _sensorSize / SIZE_OF_FLOAT_IN_MEMORY > _sensorMemorySize)
            {
                _sensorMemorySize = nAgents * _sensorSize / SIZE_OF_FLOAT_IN_MEMORY;
                _sensorTensor.Dispose();
                _sensorTensor = new NativeArray<float>(_sensorMemorySize, Allocator.Persistent);
            }
            if (nAgents * _actuatorSize / SIZE_OF_FLOAT_IN_MEMORY > _actuatorMemorySize)
            {
                _actuatorMemorySize = nAgents * _actuatorSize / SIZE_OF_FLOAT_IN_MEMORY;
                _actuatorTensor.Dispose();
                _actuatorTensor = new NativeArray<float>(_actuatorMemorySize, Allocator.Persistent);
            }
            /*
             * Collecting the DataArray necessary for the computation
             */
            _logger.Log("On update with "+_componentGroup.CalculateLength()+" entities");
            var sensors = _componentGroup.GetComponentDataArray<TS>();
            var actuators = _componentGroup.GetComponentDataArray<TA>();
            var agents = _componentGroup.GetComponentDataArray<Agent>();
            var handle = inputDeps;
            /*
             * Copy the data from the sensors to the sensor NativeArray<float> for batch processing.
             */
            var copySensorsJob = new CopySensorsJob
            {
                Sensors = sensors,
                SensorTensor = _sensorTensor,
                SensorSize = _sensorSize
            };
            handle = copySensorsJob.Schedule(nAgents, 64, handle);
            handle.Complete();
            /*
             * The Decision is called here to populate the NativeArray<float> of Actuators.
             */
            handle = Decision.DecideBatch(ref _sensorTensor, 
                ref _actuatorTensor, 
                _sensorSize / SIZE_OF_FLOAT_IN_MEMORY, 
                _actuatorSize / SIZE_OF_FLOAT_IN_MEMORY, 
                nAgents,
                handle);
            /*
             * Copy the data from the actuator NativeArray<float> to the actuators of each entity.
             */
            var copyActuatorsJob = new CopyActuatorsJob
            {
                ActuatorTensor = _actuatorTensor,
                Actuators = actuators,
                ActuatorSize = _actuatorSize
            }; 
            return copyActuatorsJob.Schedule(nAgents, 64, handle);
        }
        /*
         * This IJobParallelFor copied the Sensor data into a NativeArray<float>
         */
 //        [BurstCompile]
        private struct CopySensorsJob : IJobParallelFor
        {
            [ReadOnly] public ComponentDataArray<TS> Sensors;
            public NativeArray<float> SensorTensor;
            [ReadOnly] public int SensorSize;
            public void Execute(int i)
            {
                TensorUtility.CopyToNativeArray(Sensors[i], SensorTensor, i * SensorSize);
            }
        }
        /*
         * This IJobParallelFor copies the Actuator data to the appropriate IComponentData
         */
 //        [BurstCompile]
        private struct CopyActuatorsJob : IJobParallelFor
        {
            public ComponentDataArray<TA> Actuators;
            public NativeArray<float> ActuatorTensor;
            [ReadOnly] public int ActuatorSize;
            public void Execute(int i)
            {
                var tmp = Actuators[i];
                // TODO : Make sure there is no extra cost here
                TensorUtility.CopyFromNativeArray(ActuatorTensor, out tmp, i * ActuatorSize);
                Actuators[i] = tmp;
            }
        }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/AgentSystem.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/AgentSystem.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: 0e421bb6f29cc4f90ad195a589c8782c
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/DebugAgent.cs
+++ b/Assets/ECS_MLAgents_v0/Core/DebugAgent.cs
@ -0,0 +1,37 @@
 //#define DEBUG_AGENT
 #if DEBUG_AGENT
 using UnityEngine;
 #endif
 namespace ECS_MLAgents_v0.Core
 {    
    /*
     * A class for debugging. The messages will only be printed when the define symbol DEBUG_AGENT
     * is on.
     */
    public class Logger
    {
        private string _prefix;
        /// <summary>
        /// Constructor for the Logger object.
        /// </summary>
        /// <param name="prefix">The prefix that will be printed at the begining of each message
        /// logged by the Logger instance</param>
        public Logger(string prefix)
        {
            _prefix = prefix;
        }
        /// <summary>
        /// Logs the message provided as input using the UnityEngine Debug.Log call.
        /// </summary>
        /// <param name="message"></param>
        public void Log(object message)
        {
            #if DEBUG_AGENT
            Debug.Log(_prefix +" : "+ message);
            #endif
        }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/DebugAgent.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/DebugAgent.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 2381447268bd4cc18e8f60043e9016b2
 timeCreated: 1548538551
--- a/Assets/ECS_MLAgents_v0/Core/IAgentDecision.cs
+++ b/Assets/ECS_MLAgents_v0/Core/IAgentDecision.cs
@ -0,0 +1,34 @@
 using Unity.Collections;
 using Unity.Jobs;
 namespace ECS_MLAgents_v0.Core
 {
    /*
     * The Interface to define a Decision process by which a bach of agent updates its actuator
     * based on the information present in the sensor.
     */
    public interface IAgentDecision
    {
        /// <summary>
        /// DecideBatch updates the aggregated actuators of the agents present in the batch from
        /// the aggregated actuators. 
        /// </summary>
        /// <param name="sensor">The aggregated data for the sensor information present in the
        /// batch. The sensor data is linearly arranged.</param>
        /// <param name="actuator">The aggregated data for the actuator information present in the
        /// batch. The sensor data is linearly arranged.</param>
        /// <param name="sensorSize">The number of float values present in a sensor for one agent
        /// </param>
        /// <param name="actuatorSize">The number of float values present in an actuator
        /// for one agent</param>
        /// <param name="nAgents">The number of agents present in the batch</param>
        /// <param name="handle">The JobHandle for the input dependencies.</param>
        /// <returns>The Job Handle for the output dependencies.</returns>
        JobHandle DecideBatch(ref NativeArray<float> sensor, 
            ref NativeArray<float> actuator, 
            int sensorSize, 
            int actuatorSize,
            int nAgents,
            JobHandle handle);
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/IAgentDecision.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/IAgentDecision.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 5ae010d1ac834febb1f3e5c038cab36e
 timeCreated: 1548539963
--- a/Assets/ECS_MLAgents_v0/Core/IAgentSystem.cs
+++ b/Assets/ECS_MLAgents_v0/Core/IAgentSystem.cs
@ -0,0 +1,51 @@
 using Unity.Entities;
 namespace ECS_MLAgents_v0.Core
 {
    public interface IAgentSystem
    {
        /// <summary>
        /// If true, the AgentSystem will perform on the agents
        /// </summary>
        bool Enabled { get; set; }
        /// <summary>
        /// The IAgentDecision that will be used to update the Actuators of compatible Entities.
        /// </summary>
        IAgentDecision Decision { get; set; }
        /// <summary>
        /// This method defines what are the required ComponentType that are needed on an Entity
        /// to be affected by the AgentSystem. Note : This will reset any filter previously set.
        /// </summary>
        /// <param name="t"> The ComponentType that are required on the Entities.</param>
        void SetNewComponentGroup(params ComponentType[] t);
        /// <summary>
        /// Allows the creation of a filter on the Entities affected by the AgentSystem. 
        /// </summary>
        /// <param name="filter"> A ISharedComponentData instance used for filtering</param>
        /// <typeparam name="T"> The type of the ISharedComponentData filter</typeparam>
        void SetFilter<T>(T filter) where T : struct, ISharedComponentData;
        /// <summary>
        /// Allows the creation of a filter on the Entities affected by the AgentSystem. 
        /// </summary>
        /// <param name="filterA">The first ISharedComponentData instance used for filtering
        /// </param>
        /// <param name="filterB">The second ISharedComponentData instance used for filtering
        /// </param>
        /// <typeparam name="T0">The type of the first ISharedComponentData filter</typeparam>
        /// <typeparam name="T1">The type of the second ISharedComponentData filter</typeparam>
        void SetFilter<T0, T1>(T0 filterA, T1 filterB)
            where T0 : struct, ISharedComponentData
            where T1 : struct, ISharedComponentData;
        /// <summary>
        /// Resets the filter previously set on this AgentSystem
        /// </summary>
        void ResetFilter();
        int DecisionInterval { get; set; }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/IAgentSystem.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/IAgentSystem.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 1353060c81d44091b517eeb1d0ae4597
 timeCreated: 1549238797
--- a/Assets/ECS_MLAgents_v0/Core/Inference.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Inference.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: 1c8575b3918494070a6f53c91c03941e
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/Inference/Editor.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/Editor.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: fb8a49bbf26a244d8bcbd4b90fcc007f
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/Inference/Editor/NNModelImporter.cs
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/Editor/NNModelImporter.cs
@ -0,0 +1,28 @@
 using System.IO;
 using UnityEditor;
 using UnityEngine;
 using UnityEditor.Experimental.AssetImporters;
 namespace ECS_MLAgents_v0.Core.Inference.Editor
 {
    /// <summary>
    /// Asset Importer of barracuda models.
    /// </summary>
    [ScriptedImporter(1, new[] {"nn"})]
    public class NNModelImporter : ScriptedImporter {
        private const string IconPath = "Assets/ML-Agents/Resources/NNModelIcon.png";
        public override void OnImportAsset(AssetImportContext ctx)
        {
            var model = File.ReadAllBytes(ctx.assetPath);
            var asset = ScriptableObject.CreateInstance<NNModel>();
            asset.Value = model;
            Texture2D texture = (Texture2D)
                AssetDatabase.LoadAssetAtPath(IconPath, typeof(Texture2D));
            ctx.AddObjectToAsset(ctx.assetPath, asset, texture);
            ctx.SetMainObject(asset);
        }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/Inference/Editor/NNModelImporter.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/Editor/NNModelImporter.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: 87cd9c69c75e6491c9d014a8b05de59c
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/Inference/InferenceDevice.cs
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/InferenceDevice.cs
@ -0,0 +1,9 @@
 namespace ECS_MLAgents_v0.Core.Inference
 {
    public enum InferenceDevice
    {
        CPU = 0,
        GPU = 1
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/Inference/InferenceDevice.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/InferenceDevice.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 991a669f5ba0473ab89633672994543f
 timeCreated: 1549237699
--- a/Assets/ECS_MLAgents_v0/Core/Inference/NNModel.cs
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/NNModel.cs
@ -0,0 +1,10 @@
 using UnityEngine;
 namespace ECS_MLAgents_v0.Core.Inference
 {
    public class NNModel : ScriptableObject
    {
        [HideInInspector]
        public byte[] Value;
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/Inference/NNModel.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/Inference/NNModel.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: 1d92b4646016e4d20a97c85418644c9a
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/NNDecision.cs
+++ b/Assets/ECS_MLAgents_v0/Core/NNDecision.cs
@ -0,0 +1,75 @@
 using Barracuda;
 using ECS_MLAgents_v0.Core.Inference;
 using Unity.Collections;
 using Unity.Jobs;
 namespace ECS_MLAgents_v0.Core
 {
    /// <summary>
    /// This class uses a pretrained Neural Network model to take the decisions for a batch of
    /// agents. As such, it implements a IAgentDecision interface and requires a Barracuda Neural
    /// Network model as input during construction.
    /// </summary>
    public class NNDecision : IAgentDecision
    {
        private NNModel _model; 
        public InferenceDevice inferenceDevice = InferenceDevice.CPU;
        private Model _barracudaModel;
        private IWorker _engine;
        private const bool _verbose = false;
        private float[] sensorData = new float[0];
        /// <summary>
        /// Generates a new NNDecision object that uses the model input to take a decision for
        /// the agents present in the batches.
        /// </summary>
        /// <param name="model"> The Barracuda NNModel that will be use for the decision</param>
        public NNDecision(NNModel model)
        {
            _model = model;
            D.logEnabled = _verbose;
            _engine?.Dispose();
            _barracudaModel = ModelLoader.Load(model.Value);
            var executionDevice = inferenceDevice == InferenceDevice.GPU
                ? BarracudaWorkerFactory.Type.ComputeFast
                : BarracudaWorkerFactory.Type.CSharpFast;
            _engine = BarracudaWorkerFactory.CreateWorker(
                executionDevice, _barracudaModel, _verbose);
        }
        public JobHandle DecideBatch(ref NativeArray<float> sensor,
            ref NativeArray<float> actuator,
            int sensorSize,
            int actuatorSize, 
            int nAgents,
            JobHandle handle)
        {
            if (sensorData.Length < sensor.Length)
            {
                sensorData = new float[sensor.Length];
            }
            sensor.CopyTo(sensorData);
            // TODO : This is additional allocation here... need to go FASTER !
            var sensorT = new Tensor(
                new TensorShape(nAgents, sensorSize),
                sensorData,
                "sensor");
            _engine.Execute(sensorT);
            sensorT.Dispose();
            var actuatorT = _engine.Fetch("actuator");
            actuator.Slice(
                0, actuatorSize*nAgents).CopyFrom(actuatorT.data.Download(actuator.Length));
            actuatorT.Dispose();
            sensorT.Dispose();
            return handle;
        }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/NNDecision.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/NNDecision.cs.meta
@ -0,0 +1,11 @@
 fileFormatVersion: 2
 guid: b10ba7c68b628465e9f5b3706682ba6e
 MonoImporter:
  externalObjects: {}
  serializedVersion: 2
  defaultReferences: []
  executionOrder: 0
  icon: {instanceID: 0}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Core/TensorUtility.cs
+++ b/Assets/ECS_MLAgents_v0/Core/TensorUtility.cs
@ -0,0 +1,110 @@
 using System;
 using System.Collections.Generic;
 using System.Linq;
 using System.Reflection;
 using Unity.Collections;
 using Unity.Collections.LowLevel.Unsafe;
 using Unity.Mathematics;
 namespace ECS_MLAgents_v0.Core
 {
    /*
     * A library that uses unsafe code to copy data between structs and NativeArrays.
     */
    public static class TensorUtility
    {
        // Replace this with a set
        private static readonly List<Type> SeenTypes = new List<Type>();
        /// <summary>
        /// Copies a blittable struct of float data into a NativeArray of floats at a specific
        /// location.
        /// </summary>
        /// <param name="src"> The source struct that contains the data to be copied</param>
        /// <param name="dst"> The destination NativeArray of floats that will receive the data
        /// </param>
        /// <param name="index"> The index in the NativeArray destination at which to copy the data
        /// </param>
        /// <typeparam name="T"> The Type of the struct that will be copied.</typeparam>
        public static void CopyToNativeArray<T>(T src, NativeArray<float> dst, int index)
            where T : struct
        {
            if (!SeenTypes.Contains(typeof(T)))
            {
                DebugCheckStructure(typeof(T));
            }
            unsafe
            {
                UnsafeUtility.CopyStructureToPtr<T>(ref src, (byte*) (dst.GetUnsafePtr()) + index);
            }
        }
        /// <summary>
        /// Copies the content of a NativeArray of float at a specific location into a blittable
        /// struct of float.
        /// </summary>
        /// <param name="src"> The source NativeArray that contains the data to be copied.</param>
        /// <param name="dst"> The destination struct that will receive the data</param>
        /// <param name="index"> The index in the NativeArray at which the data is located.</param>
        /// <typeparam name="T"> The Type of the struct that will receive the data</typeparam>
        public static void CopyFromNativeArray<T>(NativeArray<float> src, out T dst, int index)
            where T : struct
        {
            if (!SeenTypes.Contains(typeof(T)))
            {
                DebugCheckStructure(typeof(T));
            }
            unsafe
            {
                UnsafeUtility.CopyPtrToStructure((byte*) (src.GetUnsafePtr()) + index, out dst);
            }
        }
        /// <summary>
        /// A helper method that checks if the type of a struct is supported by the library. The
        /// struct must be blittable and only contain fields of float with a valid type.
        /// </summary>
        /// <param name="t"> The Type that will be checked</param>
        /// <exception cref="NotSupportedException"> NotSupportedException will be raised if the
        /// Type t is not valid for use by the library.</exception>
        private static void DebugCheckStructure(Type t)
        {
            SeenTypes.Add(t);
            if (t.GetFields(BindingFlags.Public | BindingFlags.Instance)
                .Any(f => !IsCompatibleObservationFieldType(f.FieldType)))
            {
                throw new NotSupportedException(
                    "You are trying to add an struct as observation data which contains an " +
                    "incompatible member type. Only float and vectors are supported for " +
                    "struct observations");
            }
        }
        /// <summary>
        /// Helper method that checks if the type of a field is a compatible blittable float.
        /// </summary>
        /// <param name="t"> The Type of the field.</param>
        /// <returns> True if the Type is compatible and false otherwise.</returns>
        private static bool IsCompatibleObservationFieldType(Type t)
        {
            if (t == typeof(float))
                return true;
            if (t == typeof(float2))
                return true;
            if (t == typeof(float3))
                return true;
            if (t == typeof(float4))
                return true;
            if (t == typeof(quaternion))
                return true;
            if (t == typeof(float2x2))
                return true;
            if (t == typeof(float3x3))
                return true;
            if (t == typeof(float4x4))
                return true;
            return false;
        }
    }
 }
--- a/Assets/ECS_MLAgents_v0/Core/TensorUtility.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Core/TensorUtility.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 331744d812e64f74910ee4d5727312cd
 timeCreated: 1548540468
--- a/Assets/ECS_MLAgents_v0/Example.meta
+++ b/Assets/ECS_MLAgents_v0/Example.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 8fc552fa6c4441fa8d84ba78bfbc22d7
 timeCreated: 1548439524
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: 06f334ad4721424cb426ab1ee3b705c8
 timeCreated: 1548624746
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: fee4441a574c64877834ac1c7c5abfc2
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab/Sphere.prefab
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab/Sphere.prefab
@ -0,0 +1,136 @@
 %YAML 1.1
 %TAG !u! tag:unity3d.com,2011:
 --- !u!1 &8480657802093770681
 GameObject:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  serializedVersion: 6
  m_Component:
  - component: {fileID: 7453438336369544782}
  - component: {fileID: 7329952648618696081}
  - component: {fileID: 7175794993082824140}
  - component: {fileID: 4931620525958682511}
  - component: {fileID: 6329899614523502008}
  - component: {fileID: 9208578237387885132}
  - component: {fileID: 3511010848622024065}
  m_Layer: 0
  m_Name: Sphere
  m_TagString: Untagged
  m_Icon: {fileID: 0}
  m_NavMeshLayer: 0
  m_StaticEditorFlags: 0
  m_IsActive: 1
 --- !u!4 &7453438336369544782
 Transform:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_LocalRotation: {x: 0, y: 0, z: 0, w: 1}
  m_LocalPosition: {x: 0, y: 0, z: 0}
  m_LocalScale: {x: 1, y: 1, z: 1}
  m_Children: []
  m_Father: {fileID: 0}
  m_RootOrder: 0
  m_LocalEulerAnglesHint: {x: 0, y: 0, z: 0}
 --- !u!114 &7329952648618696081
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 5bf10cdea1344482e91a4f2b58506b77, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
 --- !u!114 &7175794993082824140
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 9b0fd4427893a4a16ba0c267dfd00217, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  m_SerializedData:
    mesh: {fileID: 10207, guid: 0000000000000000e000000000000000, type: 0}
    material: {fileID: 10302, guid: 0000000000000000f000000000000000, type: 0}
    subMesh: 0
    castShadows: 0
    receiveShadows: 0
 --- !u!114 &4931620525958682511
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 0af0db853e732453799566a0e597993c, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  m_SerializedData:
    Value:
      x: 0
      y: 0
      z: 0
 --- !u!114 &6329899614523502008
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 4f2a8abc5e5549439b29a0f9cbb7776b, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  m_SerializedData:
    Reward:
      x: 0
      y: 0
      z: 0
 --- !u!114 &9208578237387885132
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 950225b98b4a438b843c2442cab09add, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  m_SerializedData:
    Value:
      x: 0
      y: 0
      z: 0
 --- !u!114 &3511010848622024065
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 8480657802093770681}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: d4fe5e52c44e4e2097302ddf66e78272, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  m_SerializedData:
    Value:
      x: 0
      y: 0
      z: 0
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab/Sphere.prefab.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Prefab/Sphere.prefab.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 9002bbfd4ae214a8fb609aeacaa0de4d
 PrefabImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: c4e16e919f6024426bf06364075adc0b
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene/SpaceMagic.unity
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene/SpaceMagic.unity
@ -0,0 +1,313 @@
 %YAML 1.1
 %TAG !u! tag:unity3d.com,2011:
 --- !u!29 &1
 OcclusionCullingSettings:
  m_ObjectHideFlags: 0
  serializedVersion: 2
  m_OcclusionBakeSettings:
    smallestOccluder: 5
    smallestHole: 0.25
    backfaceThreshold: 100
  m_SceneGUID: 00000000000000000000000000000000
  m_OcclusionCullingData: {fileID: 0}
 --- !u!104 &2
 RenderSettings:
  m_ObjectHideFlags: 0
  serializedVersion: 9
  m_Fog: 0
  m_FogColor: {r: 0.5, g: 0.5, b: 0.5, a: 1}
  m_FogMode: 3
  m_FogDensity: 0.01
  m_LinearFogStart: 0
  m_LinearFogEnd: 300
  m_AmbientSkyColor: {r: 0.212, g: 0.227, b: 0.259, a: 1}
  m_AmbientEquatorColor: {r: 0.114, g: 0.125, b: 0.133, a: 1}
  m_AmbientGroundColor: {r: 0.047, g: 0.043, b: 0.035, a: 1}
  m_AmbientIntensity: 1
  m_AmbientMode: 0
  m_SubtractiveShadowColor: {r: 0.42, g: 0.478, b: 0.627, a: 1}
  m_SkyboxMaterial: {fileID: 10304, guid: 0000000000000000f000000000000000, type: 0}
  m_HaloStrength: 0.5
  m_FlareStrength: 1
  m_FlareFadeSpeed: 3
  m_HaloTexture: {fileID: 0}
  m_SpotCookie: {fileID: 10001, guid: 0000000000000000e000000000000000, type: 0}
  m_DefaultReflectionMode: 0
  m_DefaultReflectionResolution: 128
  m_ReflectionBounces: 1
  m_ReflectionIntensity: 1
  m_CustomReflection: {fileID: 0}
  m_Sun: {fileID: 0}
  m_IndirectSpecularColor: {r: 0.44657838, g: 0.49641234, b: 0.57481676, a: 1}
  m_UseRadianceAmbientProbe: 0
 --- !u!157 &3
 LightmapSettings:
  m_ObjectHideFlags: 0
  serializedVersion: 11
  m_GIWorkflowMode: 0
  m_GISettings:
    serializedVersion: 2
    m_BounceScale: 1
    m_IndirectOutputScale: 1
    m_AlbedoBoost: 1
    m_EnvironmentLightingMode: 0
    m_EnableBakedLightmaps: 1
    m_EnableRealtimeLightmaps: 1
  m_LightmapEditorSettings:
    serializedVersion: 10
    m_Resolution: 2
    m_BakeResolution: 40
    m_AtlasSize: 1024
    m_AO: 0
    m_AOMaxDistance: 1
    m_CompAOExponent: 1
    m_CompAOExponentDirect: 0
    m_Padding: 2
    m_LightmapParameters: {fileID: 0}
    m_LightmapsBakeMode: 1
    m_TextureCompression: 1
    m_FinalGather: 0
    m_FinalGatherFiltering: 1
    m_FinalGatherRayCount: 256
    m_ReflectionCompression: 2
    m_MixedBakeMode: 2
    m_BakeBackend: 1
    m_PVRSampling: 1
    m_PVRDirectSampleCount: 32
    m_PVRSampleCount: 500
    m_PVRBounces: 2
    m_PVRFilterTypeDirect: 0
    m_PVRFilterTypeIndirect: 0
    m_PVRFilterTypeAO: 0
    m_PVRFilteringMode: 1
    m_PVRCulling: 1
    m_PVRFilteringGaussRadiusDirect: 1
    m_PVRFilteringGaussRadiusIndirect: 5
    m_PVRFilteringGaussRadiusAO: 2
    m_PVRFilteringAtrousPositionSigmaDirect: 0.5
    m_PVRFilteringAtrousPositionSigmaIndirect: 2
    m_PVRFilteringAtrousPositionSigmaAO: 1
    m_ShowResolutionOverlay: 1
  m_LightingDataAsset: {fileID: 0}
  m_UseShadowmask: 1
 --- !u!196 &4
 NavMeshSettings:
  serializedVersion: 2
  m_ObjectHideFlags: 0
  m_BuildSettings:
    serializedVersion: 2
    agentTypeID: 0
    agentRadius: 0.5
    agentHeight: 2
    agentSlope: 45
    agentClimb: 0.4
    ledgeDropHeight: 0
    maxJumpAcrossDistance: 0
    minRegionArea: 2
    manualCellSize: 0
    cellSize: 0.16666667
    manualTileSize: 0
    tileSize: 256
    accuratePlacement: 0
    debug:
      m_Flags: 0
  m_NavMeshData: {fileID: 0}
 --- !u!1 &883511624
 GameObject:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  serializedVersion: 6
  m_Component:
  - component: {fileID: 883511626}
  - component: {fileID: 883511625}
  m_Layer: 0
  m_Name: Manager
  m_TagString: Untagged
  m_Icon: {fileID: 0}
  m_NavMeshLayer: 0
  m_StaticEditorFlags: 0
  m_IsActive: 1
 --- !u!114 &883511625
 MonoBehaviour:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 883511624}
  m_Enabled: 1
  m_EditorHideFlags: 0
  m_Script: {fileID: 11500000, guid: 8bc729803b874b188632e9d135d5ddec, type: 3}
  m_Name: 
  m_EditorClassIdentifier: 
  maxDistance: 2
  prefab: {fileID: 8480657802093770681, guid: 9002bbfd4ae214a8fb609aeacaa0de4d, type: 3}
  modelA: {fileID: 11400002, guid: c16aa6693b8834a58855a5592bb4f5f8, type: 3}
  modelB: {fileID: 11400002, guid: a3e419de75dc44356b527bf0b06e3b81, type: 3}
  modelC: {fileID: 11400002, guid: 8533bf952c61d430e8765c2c16cde480, type: 3}
 --- !u!4 &883511626
 Transform:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 883511624}
  m_LocalRotation: {x: 0, y: 0, z: 0, w: 1}
  m_LocalPosition: {x: 0, y: 0, z: 0}
  m_LocalScale: {x: 1, y: 1, z: 1}
  m_Children: []
  m_Father: {fileID: 0}
  m_RootOrder: 2
  m_LocalEulerAnglesHint: {x: 0, y: 0, z: 0}
 --- !u!1 &1006088310
 GameObject:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  serializedVersion: 6
  m_Component:
  - component: {fileID: 1006088313}
  - component: {fileID: 1006088312}
  - component: {fileID: 1006088311}
  m_Layer: 0
  m_Name: Main Camera
  m_TagString: MainCamera
  m_Icon: {fileID: 0}
  m_NavMeshLayer: 0
  m_StaticEditorFlags: 0
  m_IsActive: 1
 --- !u!81 &1006088311
 AudioListener:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 1006088310}
  m_Enabled: 1
 --- !u!20 &1006088312
 Camera:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 1006088310}
  m_Enabled: 1
  serializedVersion: 2
  m_ClearFlags: 2
  m_BackGroundColor: {r: 0.114142045, g: 0.16881007, b: 0.254717, a: 0}
  m_projectionMatrixMode: 1
  m_SensorSize: {x: 36, y: 24}
  m_LensShift: {x: 0, y: 0}
  m_GateFitMode: 2
  m_FocalLength: 50
  m_NormalizedViewPortRect:
    serializedVersion: 2
    x: 0
    y: 0
    width: 1
    height: 1
  near clip plane: 0.3
  far clip plane: 1000
  field of view: 60
  orthographic: 0
  orthographic size: 5
  m_Depth: -1
  m_CullingMask:
    serializedVersion: 2
    m_Bits: 4294967295
  m_RenderingPath: -1
  m_TargetTexture: {fileID: 0}
  m_TargetDisplay: 0
  m_TargetEye: 3
  m_HDR: 1
  m_AllowMSAA: 1
  m_AllowDynamicResolution: 0
  m_ForceIntoRT: 0
  m_OcclusionCulling: 1
  m_StereoConvergence: 10
  m_StereoSeparation: 0.022
 --- !u!4 &1006088313
 Transform:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 1006088310}
  m_LocalRotation: {x: 0.2588191, y: 0, z: 0, w: 0.9659258}
  m_LocalPosition: {x: 0, y: 200, z: -153}
  m_LocalScale: {x: 1, y: 1, z: 1}
  m_Children: []
  m_Father: {fileID: 0}
  m_RootOrder: 0
  m_LocalEulerAnglesHint: {x: 30, y: 0, z: 0}
 --- !u!1 &1246396923
 GameObject:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  serializedVersion: 6
  m_Component:
  - component: {fileID: 1246396925}
  - component: {fileID: 1246396924}
  m_Layer: 0
  m_Name: Directional Light
  m_TagString: Untagged
  m_Icon: {fileID: 0}
  m_NavMeshLayer: 0
  m_StaticEditorFlags: 0
  m_IsActive: 1
 --- !u!108 &1246396924
 Light:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 1246396923}
  m_Enabled: 1
  serializedVersion: 8
  m_Type: 1
  m_Color: {r: 1, g: 0.95686275, b: 0.8392157, a: 1}
  m_Intensity: 1
  m_Range: 10
  m_SpotAngle: 30
  m_CookieSize: 10
  m_Shadows:
    m_Type: 2
    m_Resolution: -1
    m_CustomResolution: -1
    m_Strength: 1
    m_Bias: 0.05
    m_NormalBias: 0.4
    m_NearPlane: 0.2
  m_Cookie: {fileID: 0}
  m_DrawHalo: 0
  m_Flare: {fileID: 0}
  m_RenderMode: 0
  m_CullingMask:
    serializedVersion: 2
    m_Bits: 4294967295
  m_Lightmapping: 4
  m_LightShadowCasterMode: 0
  m_AreaSize: {x: 1, y: 1}
  m_BounceIntensity: 1
  m_ColorTemperature: 6570
  m_UseColorTemperature: 0
  m_ShadowRadius: 0
  m_ShadowAngle: 0
 --- !u!4 &1246396925
 Transform:
  m_ObjectHideFlags: 0
  m_CorrespondingSourceObject: {fileID: 0}
  m_PrefabInstance: {fileID: 0}
  m_PrefabAsset: {fileID: 0}
  m_GameObject: {fileID: 1246396923}
  m_LocalRotation: {x: 0.40821788, y: -0.23456968, z: 0.10938163, w: 0.8754261}
  m_LocalPosition: {x: 0, y: 3, z: 0}
  m_LocalScale: {x: 1, y: 1, z: 1}
  m_Children: []
  m_Father: {fileID: 0}
  m_RootOrder: 1
  m_LocalEulerAnglesHint: {x: 50, y: -30, z: 0}
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene/SpaceMagic.unity.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scene/SpaceMagic.unity.meta
@ -0,0 +1,7 @@
 fileFormatVersion: 2
 guid: 09d90853262f045e48c12ea3ba572f70
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts.meta
@ -0,0 +1,8 @@
 fileFormatVersion: 2
 guid: b6feaae123a73448f81a42f015ea41b7
 folderAsset: yes
 DefaultImporter:
  externalObjects: {}
  userData: 
  assetBundleName: 
  assetBundleVariant: 
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AccelerationComponent.cs
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AccelerationComponent.cs
@ -0,0 +1,20 @@
 using System;
 using Unity.Entities;
 using Unity.Mathematics;
 namespace ECS_MLAgents_v0.Example.SpaceMagic.Scripts
 {
    /// <summary>
    /// This component will represent the acceleration of the spheres
    /// </summary>
    [Serializable]
    public struct Acceleration : IComponentData
    {
        public float3 Value;
    }
    /// <summary>
    /// This wrapper only allows us to add this IComponentData as a Component to the sphere prefab
    /// </summary>
    public class AccelerationComponent : ComponentDataWrapper<Acceleration> { }
 }
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AccelerationComponent.cs.meta
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AccelerationComponent.cs.meta
@ -0,0 +1,3 @@
 fileFormatVersion: 2
 guid: d4fe5e52c44e4e2097302ddf66e78272
 timeCreated: 1548624826
--- a/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AgentGroupComponent.cs
+++ b/Assets/ECS_MLAgents_v0/Example/SpaceMagic/Scripts/AgentGroupComponent.cs
@ -0,0 +1,16 @@
 using System;
 using Unity.Entities;
 namespace ECS_MLAgents_v0.Example.SpaceMagic.Scripts
 {
    /// <summary>
    /// This IShareComponentData will be used to assign each sphere in a different group that will
    /// use a different IAgentSystem for its decision making.
    /// </summary>
    [Serializable]
    public struct SphereGroup : ISharedComponentData
    {
        public int Group;
    }
 }
--- a/Показать больше
+++ b/Показать больше