Merge pull request #1 from skaarthik/master

initial commit from CSharpSpark repo with contributions from @skaarthik and @xiongrenyi
This commit is contained in:
skaarthik 2015-10-29 15:40:06 -07:00
Родитель 5366a4055f 77819657e4
Коммит ec117b1cb9
83 изменённых файлов: 12193 добавлений и 1 удалений

37
.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,37 @@
# User files #
###################
*.suo
*.user
*.csproj.user
*.iml
# Compiled source #
###################
*.class
*.dll
*.exe
# Packages #
############
# it's better to unpack these files and commit the raw source
# git has its own built in compression methods
*.gz
*.jar
*.tar
*.nupkg
# Folders #
############
*/target/**
*/bin/**
*/obj/**
csharp/packages
csharp/Adapter/Microsoft.Spark.CSharp/bin/**
csharp/Adapter/Microsoft.Spark.CSharp/obj/**
csharp/AdapterTest/bin/**
csharp/AdapterTest/obj/**
csharp/Samples/Microsoft.Spark.CSharp/bin/**
csharp/Samples/Microsoft.Spark.CSharp/obj/**
csharp/Worker/Microsoft.Spark.CSharp/bin/**
csharp/Worker/Microsoft.Spark.CSharp/obj/**
scala/.idea/**

21
LICENSE Normal file
Просмотреть файл

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2015 Microsoft
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Просмотреть файл

@ -1,2 +1,94 @@
# SparkCLR
C# language binding and extensions to Apache Spark
SparkCLR adds C# language binding to Apache Spark enabling the implementation of Spark driver code and data processing operations in C#.
For example, the word count sample in Apache Spark can be implemented in C# as follows
```
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
.ReduceByKey((x, y) => x + y);
var wordCountCollection = wordCounts.Collect();
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
```
A simple DataFrame application using TempTable may look like the following
```
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
requestsDataFrame.RegisterTempTable("requests");
metricsDateFrame.RegisterTempTable("metrics");
//C0 - guid in requests DF, C3 - guid in metrics DF
var join = GetSqlContext().Sql(
"SELECT joinedtable.datacenter, max(joinedtable.latency) maxlatency, avg(joinedtable.latency) avglatency " +
"FROM (SELECT a.C1 as datacenter, b.C6 as latency from requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
"GROUP BY datacenter");
join.ShowSchema();
join.Show();
```
A simple DataFrame application using DataFrame DSL may look like the following
```
//C0 - guid, C1 - datacenter
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
.Select("C0", "C1");
//C3 - guid, C6 - latency
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = requestsDataFrame.Join(metricsDateFrame, requestsDataFrame["C0"] == metricsDateFrame["C3"])
.GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
```
Refer to SparkCLR\csharp\Samples folder for complete samples
## Building SparkCLR
### Prerequisites
* Maven for spark-clr project implemented in scala
* MSBuild for C# projects
### Instructions
* Navigate to SparkCLR\scala folder and run ```mvn package```. This will build spark-clr*.jar
* Navigate to SparkCLR\csharp folder and run the following commands to build rest of the .NET binaries
```set PATH=%PATH%;c:\Windows\Microsoft.NET\Framework\v4.0.30319``` (if MSBuild is not already in the path)
```msbuild SparkCLR.sln```
## Running Samples
### Prerequisites
Set the following environment variables
* ```JAVA_HOME```
* ```SCALA_HOME```
* ```SPARKCLR_HOME```
* ```SPARKCSV_JARS``` (if sqlContext.TextFile method is used to create DataFrame from csv files)
Folder pointed by SPARKCLR_HOME should have the following folders and files
* lib (spark-clr*.jar)
* bin (Microsoft.Spark.CSharp.Adapter.dll, CSharpWorker.exe)
* scripts (sparkclr-submit.cmd)
* samples (SparkCLRSamples.exe, Microsoft.Spark.CSharp.Adapter.dll, CSharpWorker.exe)
* data (all the data files used by samples)
### Running in Local mode
Set ```CSharpWorkerPath``` in SparkCLRSamples.exe.config and run the following
```sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe spark.local.dir D:\temp\SparkCLRTemp sparkclr.sampledata.loc D:\SparkCLRHome\data```
Setting spark.local.dir parameter is optional and it is useful if local setup of Spark uses %TEMP% folder in windows to which adding SparkCLR driver exe file may cause problems (AV programs might automatically delete executables placed in these folders)
### Running in Standalone cluster mode
```sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata```
### Running in YARN mode
//TODO
## Running Unit Tests
## Debugging Tips
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver
For example, to debug SparkCLR samples
* Launch CSharpBackend using ```sparkclr-submit.cmd debug``` and get the port number displayed in the console
* Navigate to csharp/Samples/Microsoft.Spark.CSharp and edit App.Config to use the port number from the previous step for CSharpBackendPortNumber config and also set CSharpWorkerPath config
* Run SparkCLRSamples.exe in Visual Studio
## License
SparkCLR is licensed under the MIT license. See LICENSE file in the project root for full license information.

Просмотреть файл

@ -0,0 +1,117 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Import Project="$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props" Condition="Exists('$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props')" />
<PropertyGroup>
<Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
<Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform>
<ProjectGuid>{CE999A96-F42B-4E80-B208-709D7F49A77C}</ProjectGuid>
<OutputType>Library</OutputType>
<AppDesignerFolder>Properties</AppDesignerFolder>
<RootNamespace>Microsoft.Spark.CSharp</RootNamespace>
<AssemblyName>Microsoft.Spark.CSharp.Adapter</AssemblyName>
<TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
<FileAlignment>512</FileAlignment>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Debug|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugSymbols>true</DebugSymbols>
<DebugType>full</DebugType>
<Optimize>false</Optimize>
<OutputPath>bin\Debug\</OutputPath>
<DefineConstants>DEBUG;TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
<Prefer32Bit>false</Prefer32Bit>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Release|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugType>pdbonly</DebugType>
<Optimize>true</Optimize>
<OutputPath>bin\Release\</OutputPath>
<DefineConstants>TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
</PropertyGroup>
<PropertyGroup>
<StartupObject />
</PropertyGroup>
<ItemGroup>
<Reference Include="System" />
<Reference Include="System.Configuration" />
<Reference Include="System.Core" />
<Reference Include="System.Runtime.Serialization" />
<Reference Include="System.Xml.Linq" />
<Reference Include="System.Data.DataSetExtensions" />
<Reference Include="Microsoft.CSharp" />
<Reference Include="System.Data" />
<Reference Include="System.Xml" />
</ItemGroup>
<ItemGroup>
<Compile Include="Configuration\ConfigurationService.cs" />
<Compile Include="Configuration\IConfigurationService.cs" />
<Compile Include="Core\Accumulator.cs" />
<Compile Include="Core\Broadcast.cs" />
<Compile Include="Core\DoubleRDDFunctions.cs" />
<Compile Include="Core\OrderedRDDFunctions.cs" />
<Compile Include="Core\PairRDDFunctions.cs" />
<Compile Include="Core\PipelinedRDD.cs" />
<Compile Include="Core\Profiler.cs" />
<Compile Include="Core\RDD.cs" />
<Compile Include="Core\SparkConf.cs" />
<Compile Include="Core\SparkContext.cs" />
<Compile Include="Core\StatCounter.cs" />
<Compile Include="Core\StatusTracker.cs" />
<Compile Include="Core\StorageLevel.cs" />
<Compile Include="Interop\SparkCLREnvironment.cs" />
<Compile Include="Interop\Ipc\SparkCLRSocket.cs" />
<Compile Include="Interop\Ipc\ISparkCLRSocket.cs" />
<Compile Include="Interop\Ipc\IJvmBridge.cs" />
<Compile Include="Interop\Ipc\JvmBridge.cs" />
<Compile Include="Interop\Ipc\JvmObjectReference.cs" />
<Compile Include="Interop\Ipc\PayloadHelper.cs" />
<Compile Include="Interop\Ipc\SerDe.cs" />
<Compile Include="Proxy\IDataFrameProxy.cs" />
<Compile Include="Proxy\Ipc\DataFrameIpcProxy.cs" />
<Compile Include="Proxy\Ipc\RDDIpcProxy.cs" />
<Compile Include="Proxy\Ipc\SqlContextIpcProxy.cs" />
<Compile Include="Proxy\Ipc\StatusTrackerIpcProxy.cs" />
<Compile Include="Proxy\Ipc\StructIpcProxy.cs" />
<Compile Include="Proxy\IRDDProxy.cs" />
<Compile Include="Proxy\ISparkConfProxy.cs" />
<Compile Include="Proxy\ISparkContextProxy.cs" />
<Compile Include="Proxy\Ipc\SparkConfIpcProxy.cs" />
<Compile Include="Proxy\ISqlContextProxy.cs" />
<Compile Include="Proxy\IStatusTrackerProxy.cs" />
<Compile Include="Proxy\IStructProxy.cs" />
<Compile Include="Proxy\Ipc\SparkContextIpcProxy.cs" />
<Compile Include="Services\DefaultLoggerService.cs" />
<Compile Include="Services\ILoggerService.cs" />
<Compile Include="Services\Log4NetLoggerService.cs" />
<Compile Include="Services\LoggerServiceFactory.cs" />
<Compile Include="Sql\DataFrame.cs" />
<Compile Include="Sql\SqlContext.cs" />
<Compile Include="Sql\Struct.cs" />
<Compile Include="Streaming\DStream.cs" />
<Compile Include="Streaming\Kafka.cs" />
<Compile Include="Streaming\StreamingContext.cs" />
</ItemGroup>
<ItemGroup>
<Folder Include="Properties\" />
</ItemGroup>
<Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />
<PropertyGroup>
<PostBuildEvent>
</PostBuildEvent>
</PropertyGroup>
<PropertyGroup>
<PreBuildEvent>
</PreBuildEvent>
</PropertyGroup>
<!-- To modify your build process, add your task inside one of the targets below and uncomment it.
Other similar extension points exist, see Microsoft.Common.targets.
<Target Name="BeforeBuild">
</Target>
<Target Name="AfterBuild">
</Target>
-->
</Project>

Просмотреть файл

@ -0,0 +1,187 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Services;
namespace Microsoft.Spark.CSharp.Configuration
{
/// <summary>
/// Implementation of configuration service that helps getting config settings
/// to be used in SparkCLR runtime
/// </summary>
internal class ConfigurationService : IConfigurationService
{
private SparkCLRConfiguration configuration;
private RunMode runMode; //not used anywhere for now but may come handy in the future
public int BackendPortNumber
{
get
{
return configuration.GetPortNumber();
}
}
internal ConfigurationService()
{
var appConfig = ConfigurationManager.OpenExeConfiguration(Assembly.GetEntryAssembly().Location);
var sparkMaster = Environment.GetEnvironmentVariable("spark.master"); //set by CSharpRunner when launching driver process
if (sparkMaster == null)
{
configuration = new SparkCLRDebugConfiguration(appConfig);
runMode = RunMode.DEBUG;
}
else if (sparkMaster.StartsWith("local"))
{
configuration = new SparkCLRLocalConfiguration(appConfig);
runMode = RunMode.LOCAL;
}
else if (sparkMaster.StartsWith("spark://"))
{
configuration = new SparkCLRConfiguration(appConfig);
runMode = RunMode.CLUSTER;
}
else if (sparkMaster.StartsWith("yarn"))
{
throw new NotSupportedException("YARN is not currently supported");
}
else
{
throw new NotSupportedException(string.Format("Spark master value {0} not reconginzed", sparkMaster));
}
}
public string GetCSharpRDDExternalProcessName()
{
return configuration.GetCSharpRDDExternalProcessName();
}
public string GetCSharpWorkerPath()
{
return configuration.GetCSharpWorkerPath();
}
public IEnumerable<string> GetDriverFiles()
{
return configuration.GetDriverFiles();
}
/// <summary>
/// Default configuration for SparkCLR jobs.
/// Works with Standalone cluster mode
/// May work with YARN or Mesos - needs validation when adding support for YARN/Mesos
/// </summary>
private class SparkCLRConfiguration
{
protected AppSettingsSection appSettings;
private string sparkCLRHome = Environment.GetEnvironmentVariable("SPARKCLR_HOME"); //set by sparkclr-submit.cmd
protected ILoggerService logger = LoggerServiceFactory.GetLogger(typeof(SparkCLRConfiguration));
internal SparkCLRConfiguration(System.Configuration.Configuration configuration)
{
appSettings = configuration.AppSettings;
}
internal virtual int GetPortNumber()
{
int portNo;
if (!int.TryParse(Environment.GetEnvironmentVariable("CSHARPBACKEND_PORT"), out portNo))
{
throw new Exception("Environment variable CSHARPBACKEND_PORT not set");
}
logger.LogInfo("CSharpBackend successfully read from environment variable CSHARPBACKEND_PORT");
return portNo;
}
internal virtual string GetCSharpRDDExternalProcessName()
{
//SparkCLR jar and driver, worker & dependencies are shipped using Spark file server. Thse files available in spark executing directory at executor
return "CSharpWorker.exe";
}
internal virtual string GetCSharpWorkerPath()
{
return new Uri(GetSparkCLRArtifactsPath("bin", "CSharpWorker.exe")).ToString();
}
//this works for Standlone cluster //TODO fix for YARN support
internal virtual IEnumerable<string> GetDriverFiles()
{
var driverFolder = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location);
var files = Directory.EnumerateFiles(driverFolder);
return files.Select(s => new Uri(s).ToString());
}
private string GetSparkCLRArtifactsPath(string sparkCLRSubFolderName, string fileName)
{
var filePath = Path.Combine(sparkCLRHome, sparkCLRSubFolderName, fileName);
if (!File.Exists(filePath))
{
throw new Exception(string.Format("Path {0} not exists", filePath));
}
return filePath;
}
}
/// <summary>
/// Configuration for SparkCLR jobs in ** Local ** mode
/// Needs some investigation to find out why Local mode behaves
/// different than standalone cluster mode for the configuration values
/// overridden here
/// </summary>
private class SparkCLRLocalConfiguration : SparkCLRConfiguration
{
internal SparkCLRLocalConfiguration(System.Configuration.Configuration configuration)
: base(configuration)
{}
internal override string GetCSharpRDDExternalProcessName()
{
return appSettings.Settings["CSharpWorkerPath"].Value;
}
internal override string GetCSharpWorkerPath()
{
return new Uri(appSettings.Settings["CSharpWorkerPath"].Value).ToString();
}
}
/// <summary>
/// Configuration mode for debug mode
/// This configuration exists only to make SparkCLR development & debugging easier
/// </summary>
private class SparkCLRDebugConfiguration : SparkCLRLocalConfiguration
{
internal SparkCLRDebugConfiguration(System.Configuration.Configuration configuration)
: base(configuration)
{}
internal override int GetPortNumber()
{
var cSharpBackendPortNumber = int.Parse(appSettings.Settings["CSharpBackendPortNumber"].Value);
logger.LogInfo(string.Format("CSharpBackend port number read from app config {0}", cSharpBackendPortNumber));
return cSharpBackendPortNumber;
}
}
}
public enum RunMode
{
DEBUG, //not a Spark mode but exists for dev debugging purpose
LOCAL,
CLUSTER,
//following are not currently supported
YARN,
MESOS
}
}

Просмотреть файл

@ -0,0 +1,22 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Configuration
{
/// <summary>
/// Helps getting config settings to be used in SparkCLR runtime
/// </summary>
internal interface IConfigurationService
{
int BackendPortNumber { get; }
string GetCSharpRDDExternalProcessName();
string GetCSharpWorkerPath();
IEnumerable<string> GetDriverFiles();
}
}

Просмотреть файл

@ -0,0 +1,201 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Net;
using System.Net.Sockets;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
using System.Reflection;
using Microsoft.Spark.CSharp.Interop.Ipc;
using Microsoft.Spark.CSharp.Services;
namespace Microsoft.Spark.CSharp.Core
{
/// <summary>
/// A shared variable that can be accumulated, i.e., has a commutative and associative "add"
/// operation. Worker tasks on a Spark cluster can add values to an Accumulator with the C{+=}
/// operator, but only the driver program is allowed to access its value, using C{value}.
/// Updates from the workers get propagated automatically to the driver program.
///
/// While C{SparkContext} supports accumulators for primitive data types like C{int} and
/// C{float}, users can also define accumulators for custom types by providing a custom
/// L{AccumulatorParam} object. Refer to the doctest of this module for an example.
/// </summary>
[Serializable]
public class Accumulator
{
public static Dictionary<int, Accumulator> accumulatorRegistry = new Dictionary<int, Accumulator>();
protected int accumulatorId;
[NonSerialized]
protected bool deserialized = true;
}
[Serializable]
public class Accumulator<T> : Accumulator
{
[NonSerialized]
private T value;
private AccumulatorParam<T> accumulatorParam = new AccumulatorParam<T>();
internal Accumulator(int accumulatorId, T value)
{
this.value = value;
deserialized = false;
accumulatorRegistry[accumulatorId] = this;
}
public T Value
{
// Get the accumulator's value; only usable in driver program
get
{
if (deserialized)
{
throw new ArgumentException("Accumulator.value cannot be accessed inside tasks");
}
return value;
}
// Sets the accumulator's value; only usable in driver program
set
{
if (deserialized)
{
throw new ArgumentException("Accumulator.value cannot be accessed inside tasks");
}
this.value = value;
}
}
/// <summary>
/// Adds a term to this accumulator's value
/// </summary>
/// <param name="term"></param>
/// <returns></returns>
public void Add(T term)
{
value = accumulatorParam.AddInPlace(value, term);
}
/// <summary>
/// The += operator; adds a term to this accumulator's value
/// </summary>
/// <param name="self"></param>
/// <param name="term"></param>
/// <returns></returns>
public static Accumulator<T> operator +(Accumulator<T> self, T term)
{
if (!accumulatorRegistry.ContainsKey(self.accumulatorId))
{
accumulatorRegistry[self.accumulatorId] = self;
}
self.Add(term);
return self;
}
public override string ToString()
{
return string.Format("Accumulator<id={0}, value={1}>", accumulatorId, value);
}
}
/// <summary>
/// An AccumulatorParam that uses the + operators to add values. Designed for simple types
/// such as integers, floats, and lists. Requires the zero value for the underlying type
/// as a parameter.
/// </summary>
/// <typeparam name="T"></typeparam>
[Serializable]
internal class AccumulatorParam<T>
{
/// <summary>
/// Provide a "zero value" for the type
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
internal T Zero(T value)
{
return default(T);
}
/// <summary>
/// Add two values of the accumulator's data type, returning a new value;
/// </summary>
/// <param name="value1"></param>
/// <param name="value2"></param>
/// <returns></returns>
internal T AddInPlace(T value1, T value2)
{
dynamic d1 = value1, d2 = value2;
d1 += d2;
return d1;
}
}
/// <summary>
/// A simple TCP server that intercepts shutdown() in order to interrupt
/// our continuous polling on the handler.
/// </summary>
internal class AccumulatorServer : System.Net.Sockets.TcpListener
{
private ILoggerService logger = LoggerServiceFactory.GetLogger(typeof(AccumulatorServer));
private bool serverShutdown;
internal AccumulatorServer(string host)
: base(Dns.GetHostAddresses(host).First(a => a.AddressFamily == AddressFamily.InterNetwork), 0)
{
}
internal void Shutdown()
{
serverShutdown = true;
base.Stop();
}
internal int StartUpdateServer()
{
base.Start();
Task.Run(() =>
{
try
{
IFormatter formatter = new BinaryFormatter();
using (Socket s = AcceptSocket())
using (var ns = new NetworkStream(s))
using (var br = new BinaryReader(ns))
using (var bw = new BinaryWriter(ns))
{
while (!serverShutdown)
{
int numUpdates = SerDe.Convert(br.ReadInt32());
for (int i = 0; i < numUpdates; i++)
{
var ms = new MemoryStream(br.ReadBytes(SerDe.Convert(br.ReadInt32())));
KeyValuePair<int, dynamic> update = (KeyValuePair<int, dynamic>)formatter.Deserialize(ms);
Accumulator accumulator = Accumulator.accumulatorRegistry[update.Key];
accumulator.GetType().GetMethod("Add").Invoke(accumulator, new object[] { update.Value });
}
bw.Write((byte)1); // acknowledge byte other than -1
bw.Flush();
Thread.Sleep(1000);
}
}
}
catch (Exception e)
{
logger.LogError(e.ToString());
throw;
}
});
return (base.LocalEndpoint as IPEndPoint).Port;
}
}
}

Просмотреть файл

@ -0,0 +1,110 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Runtime.Serialization.Formatters.Binary;
using Microsoft.Spark.CSharp.Proxy;
namespace Microsoft.Spark.CSharp.Core
{
/// <summary>
/// A broadcast variable created with SparkContext.Broadcast().
/// Access its value through Value.
///
/// var b = sc.Broadcast(new int[] {1, 2, 3, 4, 5})
/// b.Value
/// [1, 2, 3, 4, 5]
/// sc.Parallelize(new in[] {0, 0}).FlatMap(x: b.Value).Collect()
/// [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
/// b.Unpersist()
///
/// </summary>
[Serializable]
public class Broadcast
{
[NonSerialized]
public static Dictionary<long, Broadcast> broadcastRegistry = new Dictionary<long, Broadcast>();
[NonSerialized]
internal string broadcastObjId;
[NonSerialized]
internal string path;
internal long broadcastId;
internal Broadcast() { }
public Broadcast(string path)
{
this.path = path;
}
internal static void DumpBroadcast<T>(T value, string path)
{
var formatter = new BinaryFormatter();
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Write))
{
formatter.Serialize(fs, value);
}
}
internal static T LoadBroadcast<T>(string path)
{
var formatter = new BinaryFormatter();
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
return (T)formatter.Deserialize(fs);
}
}
}
[Serializable]
public class Broadcast<T> : Broadcast
{
[NonSerialized]
internal SparkContext sparkContext;
[NonSerialized]
private T value;
internal Broadcast(SparkContext sparkContext, T value)
{
this.sparkContext = sparkContext;
this.value = value;
path = Path.GetTempFileName();
DumpBroadcast<T>(value, path);
broadcastObjId = sparkContext.SparkContextProxy.ReadBroadcastFromFile(path, out broadcastId);
}
/// <summary>
/// Return the broadcasted value
/// </summary>
public T Value
{
get
{
if (value == null)
{
if (broadcastRegistry.ContainsKey(broadcastId))
value = LoadBroadcast<T>(broadcastRegistry[broadcastId].path);
else
throw new ArgumentException(string.Format("Attempted to use broadcast id {0} after it was destroyed.", broadcastId));
}
return value;
}
}
/// <summary>
/// Delete cached copies of this broadcast on the executors.
/// </summary>
/// <param name="blocking"></param>
public void Unpersist(bool blocking = false)
{
if (broadcastObjId == null)
throw new ArgumentException("Broadcast can only be unpersisted in driver");
sparkContext.SparkContextProxy.UnpersistBroadcast(broadcastObjId, blocking);
sparkContext.broadcastVars.Remove(this);
File.Delete(path);
}
}
}

Просмотреть файл

@ -0,0 +1,149 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Core
{
public static class DoubleRDDFunctions
{
/// <summary>
/// Add up the elements in this RDD.
///
/// sc.Parallelize(new double[] {1.0, 2.0, 3.0}).Sum()
/// 6.0
///
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double Sum(this RDD<double> self)
{
return self.Fold(0.0, (x, y) => x + y);
}
/// <summary>
/// Return a L{StatCounter} object that captures the mean, variance
/// and count of the RDD's elements in one operation.
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static StatCounter Stats(this RDD<double> self)
{
return self.MapPartitions(iter => new List<StatCounter> { new StatCounter(iter) }).Reduce((l, r) => l.Merge(r));
}
/// <summary>
/// Compute a histogram using the provided buckets. The buckets
/// are all open to the right except for the last which is closed.
/// e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
/// which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
/// and 50 we would have a histogram of 1,0,1.
///
/// If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
/// this can be switched from an O(log n) inseration to O(1) per
/// element(where n = # buckets).
///
/// Buckets must be sorted and not contain any duplicates, must be
/// at least two elements.
///
/// If `buckets` is a number, it will generates buckets which are
/// evenly spaced between the minimum and maximum of the RDD. For
/// example, if the min value is 0 and the max is 100, given buckets
/// as 2, the resulting buckets will be [0,50) [50,100]. buckets must
/// be at least 1 If the RDD contains infinity, NaN throws an exception
/// If the elements in RDD do not vary (max == min) always returns
/// a single bucket.
///
/// It will return an tuple of buckets and histogram.
///
/// >>> rdd = sc.parallelize(range(51))
/// >>> rdd.histogram(2)
/// ([0, 25, 50], [25, 26])
/// >>> rdd.histogram([0, 5, 25, 50])
/// ([0, 5, 25, 50], [5, 20, 26])
/// >>> rdd.histogram([0, 15, 30, 45, 60]) # evenly spaced buckets
/// ([0, 15, 30, 45, 60], [15, 15, 15, 6])
/// >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
/// >>> rdd.histogram(("a", "b", "c"))
/// (('a', 'b', 'c'), [2, 2])
///
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static Tuple<double[], long[]> Histogram(this RDD<double> self, int bucketCount)
{
throw new NotImplementedException();
}
/// <summary>
/// Compute the mean of this RDD's elements.
/// sc.Parallelize(new double[]{1, 2, 3}).Mean()
/// 2.0
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double Mean(this RDD<double> self)
{
return self.Stats().Mean;
}
/// <summary>
/// Compute the variance of this RDD's elements.
/// sc.Parallelize(new double[]{1, 2, 3}).Variance()
/// 0.666...
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double Variance(this RDD<double> self)
{
return self.Stats().Variance;
}
/// <summary>
/// Compute the standard deviation of this RDD's elements.
/// sc.Parallelize(new double[]{1, 2, 3}).Stdev()
/// 0.816...
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double Stdev(this RDD<double> self)
{
return self.Stats().Stdev;
}
/// <summary>
/// Compute the sample standard deviation of this RDD's elements (which
/// corrects for bias in estimating the standard deviation by dividing by
/// N-1 instead of N).
///
/// sc.Parallelize(new double[]{1, 2, 3}).SampleStdev()
/// 1.0
///
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double SampleStdev(this RDD<double> self)
{
return self.Stats().SampleStdev;
}
/// <summary>
/// Compute the sample variance of this RDD's elements (which corrects
/// for bias in estimating the variance by dividing by N-1 instead of N).
///
/// sc.Parallelize(new double[]{1, 2, 3}).SampleVariance()
/// 1.0
///
/// </summary>
/// <param name="self"></param>
/// <returns></returns>
public static double SampleVariance(this RDD<double> self)
{
return self.Stats().SampleVariance;
}
}
}

Просмотреть файл

@ -0,0 +1,76 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Core
{
public static class OrderedRDDFunctions
{
/// <summary>
/// Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
/// `collect` or `save` on the resulting RDD will return or output an ordered list of records
/// (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
/// order of the keys).
///
/// >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
/// >>> sc.parallelize(tmp).sortByKey().first()
/// ('1', 3)
/// >>> sc.parallelize(tmp).sortByKey(True, 1).collect()
/// [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
/// >>> sc.parallelize(tmp).sortByKey(True, 2).collect()
/// [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
/// >>> tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)]
/// >>> tmp2.extend([('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)])
/// >>> sc.parallelize(tmp2).sortByKey(True, 3, keyfunc=lambda k: k.lower()).collect()
/// [('a', 3), ('fleece', 7), ('had', 2), ('lamb', 5),...('white', 9), ('whose', 6)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="ascending"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> SortByKey<K, V>(
this RDD<KeyValuePair<K, V>> self,
bool ascending = true,
int? numPartitions = null)
{
throw new NotImplementedException();
}
/// <summary>
/// Repartition the RDD according to the given partitioner and, within each resulting partition,
/// sort records by their keys.
///
/// This is more efficient than calling `repartition` and then sorting within each partition
/// because it can push the sorting down into the shuffle machinery.
///
/// >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)])
/// >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
/// >>> rdd2.glom().collect()
/// [[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="numPartitions"></param>
/// <param name="partitionFunc"></param>
/// <param name="ascending"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> repartitionAndSortWithinPartitions<K, V>(
this RDD<KeyValuePair<K, V>> self,
int? numPartitions = null,
Func<K, int> partitionFunc = null,
bool ascending = true)
{
return self.MapPartitions<KeyValuePair<K, V>>(iter => ascending ? iter.OrderBy(kv => kv.Key) : iter.OrderByDescending(kv => kv.Key));
}
}
}

Просмотреть файл

@ -0,0 +1,911 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
using System.Reflection;
using System.IO;
using System.Security.Cryptography;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Core
{
/// <summary>
/// operations only available to KeyValuePair RDD
///
/// See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
/// </summary>
public static class PairRDDFunctions
{
/// <summary>
/// Return the key-value pairs in this RDD to the master as a dictionary.
///
/// var m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).CollectAsMap()
/// m[1]
/// 2
/// m[3]
/// 4
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <returns></returns>
public static Dictionary<K, V> CollectAsMap<K, V>(this RDD<KeyValuePair<K, V>> self)
{
return self.Collect().ToDictionary(kv => kv.Key, kv => kv.Value);
}
/// <summary>
/// Return an RDD with the keys of each tuple.
///
/// >>> m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Keys().Collect()
/// [1, 3]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <returns></returns>
public static RDD<K> Keys<K, V>(this RDD<KeyValuePair<K, V>> self)
{
return self.Map<K>(kv => kv.Key);
}
/// <summary>
/// Return an RDD with the values of each tuple.
///
/// >>> m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Values().Collect()
/// [2, 4]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <returns></returns>
public static RDD<V> Values<K, V>(this RDD<KeyValuePair<K, V>> self)
{
return self.Map<V>(kv => kv.Value);
}
/// <summary>
/// Merge the values for each key using an associative reduce function.
///
/// This will also perform the merging locally on each mapper before
/// sending results to a reducer, similarly to a "combiner" in MapReduce.
///
/// Output will be hash-partitioned with C{numPartitions} partitions, or
/// the default parallelism level if C{numPartitions} is not specified.
///
/// sc.Parallelize(new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .ReduceByKey((x, y) => x + y).Collect()
///
/// [('a', 2), ('b', 1)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="reduceFunc"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> ReduceByKey<K, V>(this RDD<KeyValuePair<K, V>> self, Func<V, V, V> reduceFunc, int numPartitions = 0)
{
return CombineByKey(self, () => default(V), reduceFunc, reduceFunc, numPartitions);
}
/// <summary>
/// Merge the values for each key using an associative reduce function, but
/// return the results immediately to the master as a dictionary.
///
/// This will also perform the merging locally on each mapper before
/// sending results to a reducer, similarly to a "combiner" in MapReduce.
///
/// sc.Parallelize(new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .ReduceByKeyLocally((x, y) => x + y).Collect()
///
/// [('a', 2), ('b', 1)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="reduceFunc"></param>
/// <returns></returns>
public static Dictionary<K, V> ReduceByKeyLocally<K, V>(this RDD<KeyValuePair<K, V>> self, Func<V, V, V> reduceFunc)
{
return ReduceByKey(self, reduceFunc).CollectAsMap();
}
/// <summary>
/// Count the number of elements for each key, and return the result to the master as a dictionary.
///
/// sc.Parallelize(new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .CountByKey((x, y) => x + y).Collect()
///
/// [('a', 2), ('b', 1)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <returns></returns>
public static Dictionary<K, long> CountByKey<K, V>(this RDD<KeyValuePair<K, V>> self)
{
return self.MapValues(v => 1L).ReduceByKey((a, b) => a + b).CollectAsMap();
}
/// <summary>
/// Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}.
///
/// Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.
///
/// Performs a hash join across the cluster.
///
/// var l = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1);
/// var r = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 2), new KeyValuePair<string, int>("a", 3) }, 1);
/// var m = l.Join(r, 2).Collect();
///
/// [('a', (1, 2)), ('a', (1, 3))]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<V, W>>> Join<K, V, W>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W>> other,
int numPartitions = 0)
{
return self.GroupWith(other, numPartitions).FlatMapValues(
input => input.Item1.SelectMany(v => input.Item2.Select(w => new Tuple<V, W>(v, w)))
);
}
/// <summary>
/// Perform a left outer join of C{self} and C{other}.
///
/// For each element (k, v) in C{self}, the resulting RDD will either
/// contain all pairs (k, (v, w)) for w in C{other}, or the pair
/// (k, (v, None)) if no elements in C{other} have key k.
///
/// Hash-partitions the resulting RDD into the given number of partitions.
///
/// var l = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1);
/// var r = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 2) }, 1);
/// var m = l.LeftOuterJoin(r).Collect();
///
/// [('a', (1, 2)), ('b', (4, None))]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<V, W>>> LeftOuterJoin<K, V, W>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W>> other,
int numPartitions = 0)
{
return self.GroupWith(other, numPartitions).FlatMapValues(
input => input.Item1.SelectMany(v => input.Item2.DefaultIfEmpty().Select(w => new Tuple<V, W>(v, w)))
);
}
/// <summary>
/// Perform a right outer join of C{self} and C{other}.
///
/// For each element (k, w) in C{other}, the resulting RDD will either
/// contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w))
/// if no elements in C{self} have key k.
///
/// Hash-partitions the resulting RDD into the given number of partitions.
///
/// var l = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 2) }, 1);
/// var r = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1);
/// var m = l.RightOuterJoin(r).Collect();
///
/// [('a', (2, 1)), ('b', (None, 4))]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<V, W>>> RightOuterJoin<K, V, W>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W>> other,
int numPartitions = 0)
{
return self.GroupWith(other, numPartitions).FlatMapValues(
input => input.Item1.DefaultIfEmpty().SelectMany(v => input.Item2.Select(w => new Tuple<V, W>(v, w)))
);
}
/// <summary>
/// Perform a full outer join of C{self} and C{other}.
///
/// For each element (k, v) in C{self}, the resulting RDD will either
/// contain all pairs (k, (v, w)) for w in C{other}, or the pair
/// (k, (v, None)) if no elements in C{other} have key k.
///
/// Similarly, for each element (k, w) in C{other}, the resulting RDD will
/// either contain all pairs (k, (v, w)) for v in C{self}, or the pair
/// (k, (None, w)) if no elements in C{self} have key k.
///
/// Hash-partitions the resulting RDD into the given number of partitions.
///
/// var l = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1);
/// var r = sc.Parallelize(
/// new[] { new KeyValuePair<string, int>("a", 2), new KeyValuePair<string, int>("c", 8) }, 1);
/// var m = l.FullOuterJoin(r).Collect();
///
/// [('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<V, W>>> FullOuterJoin<K, V, W>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W>> other,
int numPartitions = 0)
{
return self.GroupWith(other, numPartitions).FlatMapValues(
input => input.Item1.DefaultIfEmpty().SelectMany(v => input.Item2.DefaultIfEmpty().Select(w => new Tuple<V, W>(v, w)))
);
}
/// <summary>
/// Return a copy of the RDD partitioned using the specified partitioner.
///
/// sc.Parallelize(new[] { 1, 2, 3, 4, 2, 4, 1 }, 1).Map(x => new KeyValuePair<int, int>(x, x)).PartitionBy(3).Glom().Collect()
/// </summary>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> PartitionBy<K, V>(this RDD<KeyValuePair<K, V>> self, int numPartitions = 0)
{
if (numPartitions == 0)
{
numPartitions = SparkCLREnvironment.SparkConfProxy.GetInt("spark.default.parallelism", 0);
if (numPartitions == 0)
numPartitions = self.previousRddProxy.PartitionLength();
}
var keyed = self.MapPartitionsWithIndex(new AddShuffleKeyHelper<K, V>().Execute, true);
keyed.bypassSerializer = true;
// convert shuffling version of RDD[(Long, Array[Byte])] back to normal RDD[Array[Byte]]
// invoking property keyed.RddProxy marks the end of current pipeline RDD after shuffling
// and potentially starts next pipeline RDD with defult SerializedMode.Byte
var rdd = self.SparkContext.SparkContextProxy.CreatePairwiseRDD<K, V>(keyed.RddProxy, numPartitions);
//rdd.partitioner = partitioner
return new RDD<KeyValuePair<K, V>>(rdd, self.SparkContext);
}
/// <summary>
/// # TODO: add control over map-side aggregation
/// Generic function to combine the elements for each key using a custom
/// set of aggregation functions.
///
/// Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined
/// type" C. Note that V and C can be different -- for example, one might
/// group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).
///
/// Users provide three functions:
///
/// - C{initialValue}, e.g., creates an empty list
/// - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
/// a list)
/// - C{mergeCombiners}, to combine two C's into a single one.
///
/// In addition, users can control the partitioning of the output RDD.
///
/// sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect()
///
/// [('a', '11'), ('b', '1')]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, C>> CombineByKey<K, V, C>(
this RDD<KeyValuePair<K, V>> self,
Func<C> createCombiner,
Func<C, V, C> mergeValue,
Func<C, C, C> mergeCombiners,
int numPartitions = 0)
{
if (numPartitions == 0)
{
numPartitions = SparkCLREnvironment.SparkConfProxy.GetInt("spark.default.parallelism", 0);
if (numPartitions == 0 && self.previousRddProxy != null)
numPartitions = self.previousRddProxy.PartitionLength();
}
var locallyCombined = self.MapPartitions(new GroupByCombineHelper<K, V, C>(createCombiner, mergeValue).Execute, true);
var shuffled = locallyCombined.PartitionBy(numPartitions);
return shuffled.MapPartitions(new GroupByMergeHelper<K, C>(mergeCombiners).Execute, true);
}
/// <summary>
/// Aggregate the values of each key, using given combine functions and a neutral
/// "zero value". This function can return a different result type, U, than the type
/// of the values in this RDD, V. Thus, we need one operation for merging a V into
/// a U and one operation for merging two U's, The former operation is used for merging
/// values within a partition, and the latter is used for merging values between
/// partitions. To avoid memory allocation, both of these functions are
/// allowed to modify and return their first argument instead of creating a new U.
///
/// sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect()
///
/// [('a', 2), ('b', 1)]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="U"></typeparam>
/// <param name="self"></param>
/// <param name="zeroValue"></param>
/// <param name="seqOp"></param>
/// <param name="combOp"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, U>> AggregateByKey<K, V, U>(
this RDD<KeyValuePair<K, V>> self,
Func<U> zeroValue,
Func<U, V, U> seqOp,
Func<U, U, U> combOp,
int numPartitions = 0)
{
return self.CombineByKey(zeroValue, seqOp, combOp, numPartitions);
}
/// <summary>
/// Merge the values for each key using an associative function "func"
/// and a neutral "zeroValue" which may be added to the result an
/// arbitrary number of times, and must not change the result
/// (e.g., 0 for addition, or 1 for multiplication.).
///
/// sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect()
///
/// [('a', 2), ('b', 1)]
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="zeroValue"></param>
/// <param name="func"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> FoldByKey<K, V>(
this RDD<KeyValuePair<K, V>> self,
Func<V> zeroValue,
Func<V, V, V> func,
int numPartitions = 0)
{
return self.CombineByKey(zeroValue, func, func, numPartitions);
}
/// <summary>
/// Group the values for each key in the RDD into a single sequence.
/// Hash-partitions the resulting RDD with numPartitions partitions.
///
/// Note: If you are grouping in order to perform an aggregation (such as a
/// sum or average) over each key, using reduceByKey or aggregateByKey will
/// provide much better performance.
///
/// sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, int>("a", 1),
/// new KeyValuePair<string, int>("b", 1),
/// new KeyValuePair<string, int>("a", 1)
/// }, 2)
/// .GroupByKey().MapValues(l => string.Join(" ", l)).Collect()
///
/// [('a', [1, 1]), ('b', [1])]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, List<V>>> GroupByKey<K, V>(this RDD<KeyValuePair<K, V>> self, int numPartitions = 0)
{
return CombineByKey(self,
() => new List<V>(),
(c, v) => { c.Add(v); return c; },
(c1, c2) => { c1.AddRange(c2); return c1; },
numPartitions);
}
/// <summary>
/// Pass each value in the key-value pair RDD through a map function
/// without changing the keys; this also retains the original RDD's partitioning.
///
/// sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, string[]>("a", new[]{"apple", "banana", "lemon"}),
/// new KeyValuePair<string, string[]>("b", new[]{"grapes"})
/// }, 2)
/// .MapValues(x => x.Length).Collect()
///
/// [('a', 3), ('b', 1)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="U"></typeparam>
/// <param name="self"></param>
/// <param name="func"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, U>> MapValues<K, V, U>(this RDD<KeyValuePair<K, V>> self, Func<V, U> func)
{
return self.Map(new MapValuesHelper<K, V, U>(func).Execute, true);
}
/// <summary>
/// Pass each value in the key-value pair RDD through a flatMap function
/// without changing the keys; this also retains the original RDD's partitioning.
///
/// x = sc.Parallelize(
/// new[]
/// {
/// new KeyValuePair<string, string[]>("a", new[]{"x", "y", "z"}),
/// new KeyValuePair<string, string[]>("b", new[]{"p", "r"})
/// }, 2)
/// .FlatMapValues(x => x).Collect()
///
/// [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="U"></typeparam>
/// <param name="self"></param>
/// <param name="func"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, U>> FlatMapValues<K, V, U>(this RDD<KeyValuePair<K, V>> self, Func<V, IEnumerable<U>> func)
{
return self.FlatMap(new FlatMapValuesHelper<K, V, U>(func).Execute, true);
}
/// <summary>
/// For each key k in C{self} or C{other}, return a resulting RDD that
/// contains a tuple with the list of values for that key in C{self} as well as C{other}.
///
/// var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2);
/// var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1);
/// x.GroupWith(y).Collect();
///
/// [('a', ([1], [2])), ('b', ([4], []))]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<List<V>, List<W>>>> GroupWith<K, V, W>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W>> other,
int numPartitions = 0)
{
return self.MapValues(v => new Tuple<int, dynamic>(0, v))
.Union(other.MapValues(w => new Tuple<int, dynamic>(1, w)))
.CombineByKey(
() => new Tuple<List<V>, List<W>>(new List<V>(), new List<W>()),
(c, v) => { if (v.Item1 == 0) c.Item1.Add((V)v.Item2); else c.Item2.Add((W)v.Item2); return c; },
(c1, c2) => { c1.Item1.AddRange(c2.Item1); c1.Item2.AddRange(c2.Item2); return c1; },
numPartitions);
}
/// <summary>
/// var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2);
/// var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2);
/// var z = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1);
/// x.GroupWith(y, z).Collect();
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W1"></typeparam>
/// <typeparam name="W2"></typeparam>
/// <param name="self"></param>
/// <param name="other1"></param>
/// <param name="other2"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<List<V>, List<W1>, List<W2>>>> GroupWith<K, V, W1, W2>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W1>> other1,
RDD<KeyValuePair<K, W2>> other2,
int numPartitions = 0)
{
return self.MapValues(v => new Tuple<int, dynamic>(0, v))
.Union(other1.MapValues(w1 => new Tuple<int, dynamic>(1, w1)))
.Union(other2.MapValues(w2 => new Tuple<int, dynamic>(2, w2)))
.CombineByKey(
() => new Tuple<List<V>, List<W1>, List<W2>>(new List<V>(), new List<W1>(), new List<W2>()),
(c, v) => { if (v.Item1 == 0) c.Item1.Add((V)v.Item2); else if (v.Item1 == 1) c.Item2.Add((W1)v.Item2); else c.Item3.Add((W2)v.Item2); return c; },
(c1, c2) => { c1.Item1.AddRange(c2.Item1); c1.Item2.AddRange(c2.Item2); c1.Item3.AddRange(c2.Item3); return c1; },
numPartitions);
}
/// <summary>
/// var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2);
/// var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2);
/// var z = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1);
/// var w = sc.Parallelize(new[] { new KeyValuePair<string, int>("b", 42) }, 1);
/// var m = x.GroupWith(y, z, w).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3) + " : " + string.Join(" ", l.Item4)).Collect();
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W1"></typeparam>
/// <typeparam name="W2"></typeparam>
/// <typeparam name="W3"></typeparam>
/// <param name="self"></param>
/// <param name="other1"></param>
/// <param name="other2"></param>
/// <param name="other3"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, Tuple<List<V>, List<W1>, List<W2>, List<W3>>>> GroupWith<K, V, W1, W2, W3>(
this RDD<KeyValuePair<K, V>> self,
RDD<KeyValuePair<K, W1>> other1,
RDD<KeyValuePair<K, W2>> other2,
RDD<KeyValuePair<K, W3>> other3,
int numPartitions = 0)
{
return self.MapValues(v => new Tuple<int, dynamic>(0, v))
.Union(other1.MapValues(w1 => new Tuple<int, dynamic>(1, w1)))
.Union(other2.MapValues(w2 => new Tuple<int, dynamic>(2, w2)))
.Union(other3.MapValues(w3 => new Tuple<int, dynamic>(3, w3)))
.CombineByKey(
() => new Tuple<List<V>, List<W1>, List<W2>, List<W3>>(new List<V>(), new List<W1>(), new List<W2>(), new List<W3>()),
(c, v) => { if (v.Item1 == 0) c.Item1.Add((V)v.Item2); else if (v.Item1 == 1) c.Item2.Add((W1)v.Item2); else if (v.Item1 == 2) c.Item3.Add((W2)v.Item2); else c.Item4.Add((W3)v.Item2); return c; },
(c1, c2) => { c1.Item1.AddRange(c2.Item1); c1.Item2.AddRange(c2.Item2); c1.Item3.AddRange(c2.Item3); c1.Item4.AddRange(c2.Item4); return c1; },
numPartitions);
}
/// <summary>
/// Return a subset of this RDD sampled by key (via stratified sampling).
/// Create a sample of this RDD using variable sampling rates for
/// different keys as specified by fractions, a key to sampling rate map.
///
/// var fractions = new Dictionary<string, double> { { "a", 0.2 }, { "b", 0.1 } };
/// var rdd = sc.Parallelize(fractions.Keys.ToArray(), 2).Cartesian(sc.Parallelize(Enumerable.Range(0, 1000), 2));
/// var sample = rdd.Map(t => new KeyValuePair<string, int>(t.Item1, t.Item2)).SampleByKey(false, fractions, 2).GroupByKey().Collect();
///
/// 100 < sample["a"].Length < 300 and 50 < sample["b"].Length < 150
/// true
/// max(sample["a"]) <= 999 and min(sample["a"]) >= 0
/// true
/// max(sample["b"]) <= 999 and min(sample["b"]) >= 0
/// true
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="withReplacement"></param>
/// <param name="fractions"></param>
/// <param name="seed"></param>
/// <returns></returns>
public static RDD<KeyValuePair<string, V>> SampleByKey<V>(
this RDD<KeyValuePair<string, V>> self,
bool withReplacement,
Dictionary<string, double> fractions,
long seed)
{
return new RDD<KeyValuePair<string, V>>(self.RddProxy.SampleByKey(withReplacement, fractions, seed), self.SparkContext);
}
/// <summary>
/// Return each (key, value) pair in C{self} that has no pair with matching key in C{other}.
///
/// var x = sc.Parallelize(new[] { new KeyValuePair<string, int?>("a", 1), new KeyValuePair<string, int?>("b", 4), new KeyValuePair<string, int?>("b", 5), new KeyValuePair<string, int?>("a", 2) }, 2);
/// var y = sc.Parallelize(new[] { new KeyValuePair<string, int?>("a", 3), new KeyValuePair<string, int?>("c", null) }, 2);
/// x.SubtractByKey(y).Collect();
///
/// [('b', 4), ('b', 5)]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <typeparam name="W"></typeparam>
/// <param name="self"></param>
/// <param name="other"></param>
/// <param name="numPartitions"></param>
/// <returns></returns>
public static RDD<KeyValuePair<K, V>> SubtractByKey<K, V, W>(this RDD<KeyValuePair<K, V>> self, RDD<KeyValuePair<K, W>> other, int numPartitions = 0)
{
return self.GroupWith(other, numPartitions).FlatMapValues(t => t.Item1.Where(v => t.Item2.Count == 0));
}
/// <summary>
/// Return the list of values in the RDD for key `key`. This operation
/// is done efficiently if the RDD has a known partitioner by only
/// searching the partition that the key maps to.
///
/// >>> l = range(1000)
/// >>> rdd = sc.Parallelize(Enumerable.Range(0, 1000).Zip(Enumerable.Range(0, 1000), (x, y) => new KeyValuePair<int, int>(x, y)), 10)
/// >>> rdd.lookup(42)
/// [42]
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="key"></param>
/// <returns></returns>
public static V[] Lookup<K, V>(this RDD<KeyValuePair<K, V>> self, K key)
{
return self.Filter(new LookupHelper<K, V>(key).Execute).Values().Collect();
}
/// <summary>
/// Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to any Hadoop file
/// system, using the new Hadoop OutputFormat API (mapreduce package). Keys/values are
/// converted for output using either user specified converters or, by default,
/// L{org.apache.spark.api.python.JavaToWritableConverter}.
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="conf">Hadoop job configuration, passed in as a dict</param>
public static void SaveAsNewAPIHadoopDataset<K, V>(this RDD<KeyValuePair<K, V>> self, IEnumerable<KeyValuePair<string, string>> conf)
{
self.RddProxy.SaveAsNewAPIHadoopDataset(conf);
}
/// <summary>
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="path">path to Hadoop file</param>
/// <param name="outputFormatClass">fully qualified classname of Hadoop OutputFormat (e.g. "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.IntWritable", None by default)</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.Text", None by default)</param>
/// <param name="conf">Hadoop job configuration, passed in as a dict (None by default)</param>
public static void SaveAsNewAPIHadoopFile<K, V>(this RDD<KeyValuePair<K, V>> self, string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf)
{
self.RddProxy.SaveAsNewAPIHadoopFile(path, outputFormatClass, keyClass, valueClass, conf);
}
/// <summary>
/// Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to any Hadoop file
/// system, using the old Hadoop OutputFormat API (mapred package). Keys/values are
/// converted for output using either user specified converters or, by default,
/// L{org.apache.spark.api.python.JavaToWritableConverter}.
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="conf">Hadoop job configuration, passed in as a dict</param>
public static void SaveAsHadoopDataset<K, V>(this RDD<KeyValuePair<K, V>> self, IEnumerable<KeyValuePair<string, string>> conf)
{
self.RddProxy.SaveAsHadoopDataset(conf);
}
/// <summary>
/// Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to any Hadoop file
/// system, using the old Hadoop OutputFormat API (mapred package). Key and value types
/// will be inferred if not specified. Keys and values are converted for output using either
/// user specified converters or L{org.apache.spark.api.python.JavaToWritableConverter}. The
/// C{conf} is applied on top of the base Hadoop conf associated with the SparkContext
/// of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="path">path to Hadoop file</param>
/// <param name="outputFormatClass">fully qualified classname of Hadoop OutputFormat (e.g. "org.apache.hadoop.mapred.SequenceFileOutputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.IntWritable", None by default)</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.Text", None by default)</param>
/// <param name="conf">(None by default)</param>
/// <param name="compressionCodecClass">(None by default)</param>
public static void saveAsHadoopFile<K, V>(this RDD<KeyValuePair<K, V>> self, string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf, string compressionCodecClass)
{
self.RddProxy.saveAsHadoopFile(path, outputFormatClass, keyClass, valueClass, conf, compressionCodecClass);
}
/// <summary>
/// Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to any Hadoop file
/// system, using the L{org.apache.hadoop.io.Writable} types that we convert from the
/// RDD's key and value types. The mechanism is as follows:
///
/// 1. Pyrolite is used to convert pickled Python RDD into RDD of Java objects.
/// 2. Keys and values of this Java RDD are converted to Writables and written out.
///
/// </summary>
/// <typeparam name="K"></typeparam>
/// <typeparam name="V"></typeparam>
/// <param name="self"></param>
/// <param name="path">path to sequence file</param>
/// <param name="compressionCodecClass">(None by default)</param>
public static void SaveAsSequenceFile<K, V>(this RDD<KeyValuePair<K, V>> self, string path, string compressionCodecClass)
{
self.RddProxy.SaveAsSequenceFile(path, compressionCodecClass);
}
/// <summary>
/// These classes are defined explicitly and marked as [Serializable]instead of using anonymous method as delegate to
/// prevent C# compiler from generating private anonymous type that is not serializable. Since the delegate has to be
/// serialized and sent to the Spark workers for execution, it is necessary to have the type marked [Serializable].
/// These classes are to work around the limitation on the serializability of compiler generated types
/// </summary>
[Serializable]
private class GroupByMergeHelper<K, C>
{
private readonly Func<C, C, C> mergeCombiners;
public GroupByMergeHelper(Func<C, C, C> mc)
{
mergeCombiners = mc;
}
public IEnumerable<KeyValuePair<K, C>> Execute(IEnumerable<KeyValuePair<K, C>> input)
{
return input.GroupBy(
kvp => kvp.Key,
kvp => kvp.Value,
(k, v) => new KeyValuePair<K, C>(k, v.Aggregate(mergeCombiners))
);
}
}
[Serializable]
private class GroupByCombineHelper<K, V, C>
{
private readonly Func<C> createCombiner;
private readonly Func<C, V, C> mergeValue;
public GroupByCombineHelper(Func<C> createCombiner, Func<C, V, C> mergeValue)
{
this.createCombiner = createCombiner;
this.mergeValue = mergeValue;
}
public IEnumerable<KeyValuePair<K, C>> Execute(IEnumerable<KeyValuePair<K, V>> input)
{
return input.GroupBy(
kvp => kvp.Key,
kvp => kvp.Value,
(k, v) => new KeyValuePair<K, C>(k, v.Aggregate(createCombiner(), mergeValue))
);
}
}
[Serializable]
private class AddShuffleKeyHelper<K1, V1>
{
[NonSerialized]
private static MD5 md5 = MD5.Create();
public IEnumerable<byte[]> Execute(int split, IEnumerable<KeyValuePair<K1, V1>> input)
{
IFormatter formatter = new BinaryFormatter();
foreach (var kvp in input)
{
var ms = new MemoryStream();
formatter.Serialize(ms, kvp.Key);
yield return md5.ComputeHash(ms).Take(8).ToArray();
ms = new MemoryStream();
formatter.Serialize(ms, kvp);
yield return ms.ToArray();
}
}
}
[Serializable]
private class MapValuesHelper<K, V, U>
{
private readonly Func<V, U> func;
public MapValuesHelper(Func<V, U> f)
{
func = f;
}
public KeyValuePair<K, U> Execute(KeyValuePair<K, V> kvp)
{
return new KeyValuePair<K, U>
(
kvp.Key,
func(kvp.Value)
);
}
}
[Serializable]
private class FlatMapValuesHelper<K, V, U>
{
private readonly Func<V, IEnumerable<U>> func;
public FlatMapValuesHelper(Func<V, IEnumerable<U>> f)
{
func = f;
}
public IEnumerable<KeyValuePair<K, U>> Execute(KeyValuePair<K, V> kvp)
{
return func(kvp.Value).Select(v => new KeyValuePair<K, U>(kvp.Key, v));
}
}
[Serializable]
internal class LookupHelper<K, V>
{
private readonly K key;
internal LookupHelper(K key)
{
this.key = key;
}
internal bool Execute(KeyValuePair<K, V> input)
{
return input.Key.ToString() == key.ToString();
}
}
}
}

Просмотреть файл

@ -0,0 +1,108 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Proxy;
namespace Microsoft.Spark.CSharp.Core
{
/// <summary>
/// Wraps C#-based transformations that can be executed within a stage. It helps avoid unnecessary Ser/De of data between
/// JVM & CLR to execute C# transformations and pipelines them
/// </summary>
/// <typeparam name="U"></typeparam>
[Serializable]
public class PipelinedRDD<U> : RDD<U>
{
internal Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> func; //using dynamic types to keep deserialization simple in worker side
internal bool preservesPartitioning;
//TODO - give generic types a better id
public override RDD<U1> MapPartitionsWithIndex<U1>(Func<int, IEnumerable<U>, IEnumerable<U1>> newFunc, bool preservesPartitioningParam = false)
{
if (IsPipelinable())
{
var pipelinedRDD = new PipelinedRDD<U1>
{
func = new MapPartitionsWithIndexHelper(new NewFuncWrapper<U, U1>(newFunc).Execute, func).Execute,
preservesPartitioning = preservesPartitioning && preservesPartitioningParam,
previousRddProxy = previousRddProxy,
prevSerializedMode = prevSerializedMode,
sparkContext = sparkContext,
rddProxy = null,
serializedMode = SerializedMode.Byte
};
return pipelinedRDD;
}
return base.MapPartitionsWithIndex(newFunc, preservesPartitioningParam);
}
[Serializable]
private class NewFuncWrapper<I, O>
{
private Func<int, IEnumerable<I>, IEnumerable<O>> func;
internal NewFuncWrapper(Func<int, IEnumerable<I>, IEnumerable<O>> f)
{
func = f;
}
internal IEnumerable<dynamic> Execute(int val, IEnumerable<dynamic> input)
{
return func(val, input.Cast<I>()).Cast<dynamic>();
}
}
/// <summary>
/// This class is defined explicitly instead of using anonymous method as delegate to prevent C# compiler from generating
/// private anonymous type that is not serializable. Since the delegate has to be serialized and sent to the Spark workers
/// for execution, it is necessary to have the type marked [Serializable]. This class is to work around the limitation
/// on the serializability of compiler generated types
/// </summary>
[Serializable]
private class MapPartitionsWithIndexHelper
{
private readonly Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> newFunc;
private readonly Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> prevFunc;
internal MapPartitionsWithIndexHelper(Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> nFunc, Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> pFunc)
{
prevFunc = pFunc;
newFunc = nFunc;
}
internal IEnumerable<dynamic> Execute(int split, IEnumerable<dynamic> input)
{
return newFunc(split, prevFunc(split, input));
}
}
private bool IsPipelinable()
{
return !(isCached || isCheckpointed);
}
internal override IRDDProxy RddProxy
{
get
{
if (rddProxy == null)
{
rddProxy = sparkContext.SparkContextProxy.CreateCSharpRdd(previousRddProxy,
SparkContext.BuildCommand(func, prevSerializedMode, bypassSerializer ? SerializedMode.None : serializedMode),
null, null, preservesPartitioning, sparkContext.broadcastVars, null);
}
return rddProxy;
}
}
}
}

Просмотреть файл

@ -0,0 +1,16 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Core
{
//TODO - complete the impl
public class Profiler
{
}
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,121 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using Microsoft.Spark.CSharp.Configuration;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Proxy;
using Microsoft.Spark.CSharp.Services;
namespace Microsoft.Spark.CSharp.Core
{
/// <summary>
/// Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
///
/// Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
/// by the user. Spark does not support modifying the configuration at runtime.
///
/// See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf
/// </summary>
public class SparkConf
{
private ILoggerService logger = LoggerServiceFactory.GetLogger(typeof(SparkConf));
internal ISparkConfProxy sparkConfProxy;
/// <summary>
/// Create SparkConf
/// </summary>
/// <param name="loadDefaults">indicates whether to also load values from Java system properties</param>
public SparkConf(bool loadDefaults = true)
{
SetSparkConfProxy();
sparkConfProxy.CreateSparkConf(loadDefaults);
//special handling for debug mode because
//spark.master and spark.app.name will not be set in debug mode
//driver code may override these values if SetMaster or SetAppName methods are used
if (string.IsNullOrWhiteSpace(Get("spark.master", "")))
{
logger.LogInfo("spark.master not set. Assuming debug mode.");
SetMaster("local");
}
if (string.IsNullOrWhiteSpace(Get("spark.app.name", "")))
{
logger.LogInfo("spark.app.name not set. Assuming debug mode");
SetAppName("debug app");
}
}
private void SetSparkConfProxy()
{
sparkConfProxy = SparkCLREnvironment.SparkConfProxy;
}
/// <summary>
/// The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
/// run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
/// </summary>
/// <param name="master">Spark master</param>
public SparkConf SetMaster(string master)
{
sparkConfProxy.SetMaster(master);
return this;
}
/// <summary>
/// Set a name for your application. Shown in the Spark web UI.
/// </summary>
/// <param name="appName">Name of the app</param>
public SparkConf SetAppName(string appName)
{
sparkConfProxy.SetAppName(appName);
return this;
}
/// <summary>
/// Set the location where Spark is installed on worker nodes.
/// </summary>
/// <param name="sparkHome"></param>
/// <returns></returns>
public SparkConf SetSparkHome(string sparkHome)
{
sparkConfProxy.SetSparkHome(sparkHome);
return this;
}
/// <summary>
/// Set the value of a string config
/// </summary>
/// <param name="key">Config name</param>
/// <param name="value">Config value</param>
public SparkConf Set(string key, string value)
{
sparkConfProxy.Set(key, value);
return this;
}
/// <summary>
/// Get a int parameter value, falling back to a default if not set
/// </summary>
/// <param name="key">Key to use</param>
/// <param name="defaultValue">Default value to use</param>
public int GetInt(string key, int defaultValue)
{
return sparkConfProxy.GetInt(key, defaultValue);
}
/// <summary>
/// Get a string parameter value, falling back to a default if not set
/// </summary>
/// <param name="key">Key to use</param>
/// <param name="defaultValue">Default value to use</param>
public string Get(string key, string defaultValue)
{
return sparkConfProxy.Get(key, defaultValue);
}
}
}

Просмотреть файл

@ -0,0 +1,525 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Runtime.Serialization.Formatters.Binary;
using System.Text;
using System.Threading.Tasks;
using System.Net;
using System.Net.Sockets;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Proxy;
namespace Microsoft.Spark.CSharp.Core
{
public class SparkContext
{
internal ISparkContextProxy SparkContextProxy { get; private set; }
internal List<Broadcast> broadcastVars = new List<Broadcast>();
private AccumulatorServer accumulatorServer;
private int nextAccumulatorId;
/// <summary>
/// The version of Spark on which this application is running.
/// </summary>
public string Version
{
get { return SparkContextProxy.Version; }
}
/// <summary>
/// Return the epoch time when the Spark Context was started.
/// </summary>
public long StartTime
{
get { return SparkContextProxy.StartTime; }
}
/// <summary>
/// Default level of parallelism to use when not given by user (e.g. for reduce tasks)
/// </summary>
public int DefaultParallelism
{
get { return SparkContextProxy.DefaultParallelism; }
}
/// <summary>
/// Default min number of partitions for Hadoop RDDs when not given by user
/// </summary>
public int DefaultMinPartitions
{
get { return SparkContextProxy.DefaultMinPartitions; }
}
/// <summary>
/// Get SPARK_USER for user who is running SparkContext.
/// </summary>
public string SparkUser { get { return SparkContextProxy.SparkUser; } }
/// <summary>
/// Return :class:`StatusTracker` object
/// </summary>
public StatusTracker StatusTracker { get { return new StatusTracker(SparkContextProxy.StatusTracker); } }
public SparkContext(string master, string appName, string sparkHome)
: this(master, appName, sparkHome, null)
{
}
public SparkContext(string master, string appName)
: this(master, appName, null, null)
{
}
public SparkContext(SparkConf conf)
: this(null, null, null, conf)
{
}
private SparkContext(string master, string appName, string sparkHome, SparkConf conf)
{
SparkContextProxy = SparkCLREnvironment.SparkContextProxy;
SparkContextProxy.CreateSparkContext(master, appName, sparkHome, conf.sparkConfProxy);
// AddDriverFilesToSparkContext and AddWorkerToSparkContext
foreach (var file in SparkCLREnvironment.ConfigurationService.GetDriverFiles())
{
AddFile(file);
}
AddFile(SparkCLREnvironment.ConfigurationService.GetCSharpWorkerPath());
string host = "localhost";
accumulatorServer = new AccumulatorServer(host);
int port = accumulatorServer.StartUpdateServer();
SparkContextProxy.Accumulator(host, port);
}
public RDD<string> TextFile(string filePath, int minPartitions = 0)
{
return new RDD<string>(SparkContextProxy.TextFile(filePath, minPartitions), this, SerializedMode.String);
}
/// <summary>
/// Distribute a local collection to form an RDD.
///
/// sc.Parallelize(new int[] {0, 2, 3, 4, 6}, 5).Glom().Collect()
/// [[0], [2], [3], [4], [6]]
///
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="serializableObjects"></param>
/// <param name="numSlices"></param>
/// <returns></returns>
public RDD<T> Parallelize<T>(IEnumerable<T> serializableObjects, int? numSlices)
{
List<byte[]> collectionOfByteRepresentationOfObjects = new List<byte[]>();
foreach (T obj in serializableObjects)
{
var memoryStream = new MemoryStream();
var formatter = new BinaryFormatter();
formatter.Serialize(memoryStream, obj);
collectionOfByteRepresentationOfObjects.Add(memoryStream.ToArray());
}
return new RDD<T>(SparkContextProxy.Parallelize(collectionOfByteRepresentationOfObjects, numSlices), this);
}
/// <summary>
/// Create an RDD that has no partitions or elements.
/// </summary>
/// <typeparam name="T"></typeparam>
/// <returns></returns>
public RDD<T> EmptyRDD<T>()
{
return new RDD<T>(SparkContextProxy.EmptyRDD<T>(), this, SerializedMode.None);
}
/// <summary>
/// Read a directory of text files from HDFS, a local file system (available on all nodes), or any
/// Hadoop-supported file system URI. Each file is read as a single record and returned in a
/// key-value pair, where the key is the path of each file, the value is the content of each file.
///
/// <p> For example, if you have the following files:
/// {{{
/// hdfs://a-hdfs-path/part-00000
/// hdfs://a-hdfs-path/part-00001
/// ...
/// hdfs://a-hdfs-path/part-nnnnn
/// }}}
///
/// Do
/// {{{
/// JavaPairRDD<String, String> rdd = sparkContext.wholeTextFiles("hdfs://a-hdfs-path")
/// }}}
///
/// <p> then `rdd` contains
/// {{{
/// (a-hdfs-path/part-00000, its content)
/// (a-hdfs-path/part-00001, its content)
/// ...
/// (a-hdfs-path/part-nnnnn, its content)
/// }}}
///
/// @note Small files are preferred, large file is also allowable, but may cause bad performance.
///
/// @param minPartitions A suggestion value of the minimal splitting number for input data.
/// </summary>
/// <param name="filePath"></param>
/// <param name="minPartitions"></param>
/// <returns></returns>
public RDD<KeyValuePair<byte[], byte[]>> WholeTextFiles(string filePath, int? minPartitions = null)
{
return new RDD<KeyValuePair<byte[], byte[]>>(SparkContextProxy.WholeTextFiles(filePath, minPartitions ?? DefaultMinPartitions), this, SerializedMode.Pair);
}
/// <summary>
/// Read a directory of binary files from HDFS, a local file system (available on all nodes),
/// or any Hadoop-supported file system URI as a byte array. Each file is read as a single
/// record and returned in a key-value pair, where the key is the path of each file,
/// the value is the content of each file.
///
/// For example, if you have the following files:
/// {{{
/// hdfs://a-hdfs-path/part-00000
/// hdfs://a-hdfs-path/part-00001
/// ...
/// hdfs://a-hdfs-path/part-nnnnn
/// }}}
///
/// Do
/// `JavaPairRDD<String, byte[]> rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path")`,
///
/// then `rdd` contains
/// {{{
/// (a-hdfs-path/part-00000, its content)
/// (a-hdfs-path/part-00001, its content)
/// ...
/// (a-hdfs-path/part-nnnnn, its content)
/// }}}
///
/// @note Small files are preferred; very large files but may cause bad performance.
///
/// @param minPartitions A suggestion value of the minimal splitting number for input data.
/// </summary>
/// <param name="filePath"></param>
/// <param name="minPartitions"></param>
/// <returns></returns>
public RDD<KeyValuePair<byte[], byte[]>> BinaryFiles(string filePath, int? minPartitions)
{
return new RDD<KeyValuePair<byte[], byte[]>>(SparkContextProxy.BinaryFiles(filePath, minPartitions ?? DefaultMinPartitions), this, SerializedMode.Pair);
}
/// <summary>
/// Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS,
/// a local file system (available on all nodes), or any Hadoop-supported file system URI.
/// The mechanism is as follows:
///
/// 1. A Java RDD is created from the SequenceFile or other InputFormat, and the key
/// and value Writable classes
/// 2. Serialization is attempted via Pyrolite pickling
/// 3. If this fails, the fallback is to call 'toString' on each key and value
/// 4. C{PickleSerializer} is used to deserialize pickled objects on the Python side
///
/// </summary>
/// <param name="filePath">path to sequncefile</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.Text")</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.LongWritable")</param>
/// <param name="keyConverterClass"></param>
/// <param name="valueConverterClass"></param>
/// <param name="minSplits">minimum splits in dataset (default min(2, sc.defaultParallelism))</param>
/// <returns></returns>
public RDD<byte[]> SequenceFile(string filePath, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, int? minSplits)
{
return new RDD<byte[]>(SparkContextProxy.SequenceFile(filePath, keyClass, valueClass, keyConverterClass, valueConverterClass, minSplits ?? Math.Min(DefaultParallelism, 2), 1), this, SerializedMode.None);
}
/// <summary>
/// Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS,
/// a local file system (available on all nodes), or any Hadoop-supported file system URI.
/// The mechanism is the same as for sc.sequenceFile.
///
/// A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java
///
/// </summary>
/// <param name="filePath">path to Hadoop file</param>
/// <param name="inputFormatClass">fully qualified classname of Hadoop InputFormat (e.g. "org.apache.hadoop.mapreduce.lib.input.TextInputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.Text")</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.LongWritable")</param>
/// <param name="keyConverterClass">(None by default)</param>
/// <param name="valueConverterClass">(None by default)</param>
/// <param name="conf"> Hadoop configuration, passed in as a dict (None by default)</param>
/// <param name="batchSize"></param>
/// <returns></returns>
public RDD<byte[]> NewAPIHadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass = null, string valueConverterClass = null, IEnumerable<KeyValuePair<string, string>> conf = null)
{
return new RDD<byte[]>(SparkContextProxy.NewAPIHadoopFile(filePath, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, conf, 1), this, SerializedMode.None);
}
/// <summary>
/// Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary
/// Hadoop configuration, which is passed in as a Python dict.
/// This will be converted into a Configuration in Java.
/// The mechanism is the same as for sc.sequenceFile.
///
/// </summary>
/// <param name="inputFormatClass">fully qualified classname of Hadoop InputFormat (e.g. "org.apache.hadoop.mapreduce.lib.input.TextInputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.Text")</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.LongWritable")</param>
/// <param name="keyConverterClass">(None by default)</param>
/// <param name="valueConverterClass">(None by default)</param>
/// <param name="conf">Hadoop configuration, passed in as a dict (None by default)</param>
/// <returns></returns>
public RDD<byte[]> NewAPIHadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass = null, string valueConverterClass = null, IEnumerable<KeyValuePair<string, string>> conf = null)
{
return new RDD<byte[]>(SparkContextProxy.NewAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, conf, 1), this, SerializedMode.None);
}
/// <summary>
/// Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS,
/// a local file system (available on all nodes), or any Hadoop-supported file system URI.
/// The mechanism is the same as for sc.sequenceFile.
///
/// A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java.
///
/// </summary>
/// <param name="filePath">path to Hadoop file</param>
/// <param name="inputFormatClass">fully qualified classname of Hadoop InputFormat (e.g. "org.apache.hadoop.mapred.TextInputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.Text")</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.LongWritable")</param>
/// <param name="keyConverterClass">(None by default)</param>
/// <param name="valueConverterClass">(None by default)</param>
/// <param name="conf">Hadoop configuration, passed in as a dict (None by default)</param>
/// <returns></returns>
public RDD<byte[]> HadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass = null, string valueConverterClass = null, IEnumerable<KeyValuePair<string, string>> conf = null)
{
return new RDD<byte[]>(SparkContextProxy.HadoopFile(filePath, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, conf, 1), this, SerializedMode.None);
}
/// <summary>
/// Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary
/// Hadoop configuration, which is passed in as a Python dict.
/// This will be converted into a Configuration in Java.
/// The mechanism is the same as for sc.sequenceFile.
///
/// </summary>
/// <param name="inputFormatClass">fully qualified classname of Hadoop InputFormat (e.g. "org.apache.hadoop.mapred.TextInputFormat")</param>
/// <param name="keyClass">fully qualified classname of key Writable class (e.g. "org.apache.hadoop.io.Text")</param>
/// <param name="valueClass">fully qualified classname of value Writable class (e.g. "org.apache.hadoop.io.LongWritable")</param>
/// <param name="keyConverterClass">(None by default)</param>
/// <param name="valueConverterClass">(None by default)</param>
/// <param name="conf">Hadoop configuration, passed in as a dict (None by default)</param>
/// <returns></returns>
public RDD<byte[]> HadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass = null, string valueConverterClass = null, IEnumerable<KeyValuePair<string, string>> conf = null)
{
return new RDD<byte[]>(SparkContextProxy.HadoopRDD(inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, conf, 1), this, SerializedMode.None);
}
internal RDD<T> CheckpointFile<T>(string filePath, SerializedMode serializedMode)
{
return new RDD<T>(SparkContextProxy.CheckpointFile(filePath), this, serializedMode);
}
/// <summary>
/// Build the union of a list of RDDs.
///
/// This supports unions() of RDDs with different serialized formats,
/// although this forces them to be reserialized using the default serializer:
///
/// >>> path = os.path.join(tempdir, "union-text.txt")
/// >>> with open(path, "w") as testFile:
/// ... _ = testFile.write("Hello")
/// >>> textFile = sc.textFile(path)
/// >>> textFile.collect()
/// [u'Hello']
/// >>> parallelized = sc.parallelize(["World!"])
/// >>> sorted(sc.union([textFile, parallelized]).collect())
/// [u'Hello', 'World!']
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="rdds"></param>
/// <returns></returns>
public RDD<T> Union<T>(IEnumerable<RDD<T>> rdds)
{
return new RDD<T>(SparkContextProxy.Union(rdds), this, rdds.FirstOrDefault().serializedMode);
}
/// <summary>
/// Broadcast a read-only variable to the cluster, returning a Broadcast
/// object for reading it in distributed functions. The variable will
/// be sent to each cluster only once.
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="value"></param>
/// <returns></returns>
public Broadcast<T> Broadcast<T>(T value)
{
var broadcast = new Broadcast<T>(this, value);
broadcastVars.Add(broadcast);
return broadcast;
}
/// <summary>
/// Create an L{Accumulator} with the given initial value, using a given
/// L{AccumulatorParam} helper object to define how to add values of the
/// data type if provided. Default AccumulatorParams are used for integers
/// and floating-point numbers if you do not provide one. For other types,
/// a custom AccumulatorParam can be used.
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="value"></param>
/// <returns></returns>
public Accumulator<T> Accumulator<T>(T value)
{
return new Accumulator<T>(nextAccumulatorId++, value);
}
/// <summary>
/// Shut down the SparkContext.
/// </summary>
public void Stop()
{
SparkContextProxy.Stop();
accumulatorServer.Shutdown();
}
/// <summary>
/// Add a file to be downloaded with this Spark job on every node.
/// The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
/// filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,
/// use `SparkFiles.get(fileName)` to find its download location.
/// </summary>
/// <param name="path"></param>
public void AddFile(string path)
{
SparkContextProxy.AddFile(path);
}
/// <summary>
/// Set the directory under which RDDs are going to be checkpointed. The directory must
/// be a HDFS path if running on a cluster.
/// </summary>
/// <param name="directory"></param>
public void SetCheckpointDir(string directory)
{
SparkContextProxy.SetCheckpointDir(directory);
}
/// <summary>
/// Assigns a group ID to all the jobs started by this thread until the group ID is set to a
/// different value or cleared.
///
/// Often, a unit of execution in an application consists of multiple Spark actions or jobs.
/// Application programmers can use this method to group all those jobs together and give a
/// group description. Once set, the Spark web UI will associate such jobs with this group.
///
/// The application can also use [[org.apache.spark.api.java.JavaSparkContext.cancelJobGroup]]
/// to cancel all running jobs in this group. For example,
/// {{{
/// // In the main thread:
/// sc.setJobGroup("some_job_to_cancel", "some job description");
/// rdd.map(...).count();
///
/// // In a separate thread:
/// sc.cancelJobGroup("some_job_to_cancel");
/// }}}
///
/// If interruptOnCancel is set to true for the job group, then job cancellation will result
/// in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure
/// that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208,
/// where HDFS may respond to Thread.interrupt() by marking nodes as dead.
/// </summary>
/// <param name="groupId"></param>
/// <param name="description"></param>
/// <param name="interruptOnCancel"></param>
public void SetJobGroup(string groupId, string description, bool interruptOnCancel = false)
{
SparkContextProxy.SetJobGroup(groupId, description, interruptOnCancel);
}
/// <summary>
/// Set a local property that affects jobs submitted from this thread, such as the
/// Spark fair scheduler pool.
/// </summary>
/// <param name="key"></param>
/// <param name="value"></param>
public void SetLocalProperty(string key, string value)
{
SparkContextProxy.SetLocalProperty(key, value);
}
/// <summary>
/// Get a local property set in this thread, or null if it is missing. See
/// [[org.apache.spark.api.java.JavaSparkContext.setLocalProperty]].
/// </summary>
/// <param name="key"></param>
/// <returns></returns>
public string GetLocalProperty(string key)
{
return SparkContextProxy.GetLocalProperty(key);
}
/// <summary>
/// Control our logLevel. This overrides any user-defined log settings.
/// @param logLevel The desired log level as a string.
/// Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
/// </summary>
/// <param name="logLevel"></param>
public void SetLogLevel(string logLevel)
{
SparkContextProxy.SetLogLevel(logLevel);
}
/// <summary>
/// Cancel active jobs for the specified group. See L{SparkContext.setJobGroup} for more information.
/// </summary>
/// <param name="groupId"></param>
public void CancelJobGroup(string groupId)
{
SparkContextProxy.CancelJobGroup(groupId);
}
/// <summary>
/// Cancel all jobs that have been scheduled or are running.
/// </summary>
public void CancelAllJobs()
{
SparkContextProxy.CancelAllJobs();
}
internal static byte[] BuildCommand(object func, SerializedMode deserializerMode = SerializedMode.Byte, SerializedMode serializerMode = SerializedMode.Byte)
{
var formatter = new BinaryFormatter();
var stream = new MemoryStream();
formatter.Serialize(stream, func);
List<byte[]> commandPayloadBytesList = new List<byte[]>();
// add deserializer mode
var modeBytes = Encoding.UTF8.GetBytes(deserializerMode.ToString());
var length = modeBytes.Length;
var lengthAsBytes = BitConverter.GetBytes(length);
Array.Reverse(lengthAsBytes);
commandPayloadBytesList.Add(lengthAsBytes);
commandPayloadBytesList.Add(modeBytes);
// add serializer mode
modeBytes = Encoding.UTF8.GetBytes(serializerMode.ToString());
length = modeBytes.Length;
lengthAsBytes = BitConverter.GetBytes(length);
Array.Reverse(lengthAsBytes);
commandPayloadBytesList.Add(lengthAsBytes);
commandPayloadBytesList.Add(modeBytes);
// add func
var funcBytes = stream.ToArray();
var funcBytesLengthAsBytes = BitConverter.GetBytes(funcBytes.Length);
Array.Reverse(funcBytesLengthAsBytes);
commandPayloadBytesList.Add(funcBytesLengthAsBytes);
commandPayloadBytesList.Add(funcBytes);
return commandPayloadBytesList.SelectMany(byteArray => byteArray).ToArray();
}
}
}

Просмотреть файл

@ -0,0 +1,147 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Core
{
[Serializable]
public class StatCounter
{
private long n = 0; // Running count of our values
private double mu = 0; // Running mean of our values
private double m2 = 0; // Running variance numerator (sum of (x - mean)^2)
private double maxValue = double.MinValue; // Running max of our values
private double minValue = double.MaxValue; // Running min of our values
public StatCounter()
{ }
public StatCounter(IEnumerable<double> values)
{
this.Merge(values);
}
/// <summary>
/// Add a value into this StatCounter, updating the internal statistics.
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
internal StatCounter Merge(double value)
{
var delta = value - mu;
n += 1;
mu += delta / n;
m2 += delta * (value - mu);
maxValue = Math.Max(maxValue, value);
minValue = Math.Min(minValue, value);
return this;
}
/// <summary>
/// Add multiple values into this StatCounter, updating the internal statistics.
/// </summary>
/// <param name="values"></param>
/// <returns></returns>
internal StatCounter Merge(IEnumerable<double> values)
{
foreach (var value in values)
Merge(value);
return this;
}
/// <summary>
/// Merge another StatCounter into this one, adding up the internal statistics.
/// </summary>
/// <param name="other"></param>
/// <returns></returns>
internal StatCounter Merge(StatCounter other)
{
if (other == this)
{
return Merge(other.copy()); // Avoid overwriting fields in a weird order
}
else
{
if (n == 0)
{
mu = other.mu;
m2 = other.m2;
n = other.n;
maxValue = other.maxValue;
minValue = other.minValue;
}
else if (other.n != 0)
{
var delta = other.mu - mu;
if (other.n * 10 < n)
{
mu = mu + (delta * other.n) / (n + other.n);
}
else if (n * 10 < other.n)
{
mu = other.mu - (delta * n) / (n + other.n);
}
else
{
mu = (mu * n + other.mu * other.n) / (n + other.n);
}
m2 += other.m2 + (delta * delta * n * other.n) / (n + other.n);
n += other.n;
maxValue = Math.Max(maxValue, other.maxValue);
minValue = Math.Min(minValue, other.minValue);
}
return this;
}
}
/// <summary>
/// Clone this StatCounter
/// </summary>
/// <returns></returns>
internal StatCounter copy()
{
var other = new StatCounter();
other.n = n;
other.mu = mu;
other.m2 = m2;
other.maxValue = maxValue;
other.minValue = minValue;
return other;
}
public long Count { get { return n; } }
public double Mean { get { return mu; } }
public double Sum { get { return n * mu; } }
public double Max { get { return maxValue; } }
public double Min { get { return minValue; } }
/// <summary>
/// Return the variance of the values.
/// </summary>
public double Variance { get { return n == 0 ? double.NaN : m2 / n; } }
/// <summary>
/// Return the sample variance, which corrects for bias in estimating the variance by dividing by N-1 instead of N.
/// </summary>
public double SampleVariance { get { return n <= 1 ? double.NaN : m2 / (n - 1); } }
/// <summary>
/// Return the standard deviation of the values.
/// </summary>
public double Stdev { get { return Math.Sqrt(Variance); } }
/// <summary>
/// Return the sample standard deviation of the values, which corrects for bias in estimating the variance by dividing by N-1 instead of N.
/// </summary>
public double SampleStdev { get { return Math.Sqrt(SampleVariance); } }
public override string ToString()
{
return string.Format("(count: {0}, mean: {1}, stdev: {2}, max: {3}, min: {4})", Count, Mean, Stdev, Max, Min);
}
}
}

Просмотреть файл

@ -0,0 +1,128 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Proxy;
namespace Microsoft.Spark.CSharp.Core
{
public class StatusTracker
{
private IStatusTrackerProxy statusTrackerProxy;
internal StatusTracker(IStatusTrackerProxy statusTrackerProxy)
{
this.statusTrackerProxy = statusTrackerProxy;
}
/// <summary>
/// Return a list of all known jobs in a particular job group. If
/// `jobGroup` is None, then returns all known jobs that are not
/// associated with a job group.
///
/// The returned list may contain running, failed, and completed jobs,
/// and may vary across invocations of this method. This method does
/// not guarantee the order of the elements in its result.
/// </summary>
/// <param name="jobGroup"></param>
/// <returns></returns>
public int[] GetJobIdsForGroup(string jobGroup)
{
return statusTrackerProxy.GetJobIdsForGroup(jobGroup);
}
/// <summary>
/// Returns an array containing the ids of all active stages.
/// </summary>
/// <returns></returns>
public int[] GetActiveStageIds()
{
return statusTrackerProxy.GetActiveStageIds();
}
/// <summary>
/// Returns an array containing the ids of all active jobs.
/// </summary>
/// <returns></returns>
public int[] GetActiveJobsIds()
{
return statusTrackerProxy.GetActiveJobsIds();
}
/// <summary>
/// Returns a :class:`SparkJobInfo` object, or None if the job info
/// could not be found or was garbage collected.
/// </summary>
/// <param name="jobId"></param>
/// <returns></returns>
public SparkJobInfo GetJobInfo(int jobId)
{
return statusTrackerProxy.GetJobInfo(jobId);
}
/// <summary>
/// Returns a :class:`SparkStageInfo` object, or None if the stage
/// info could not be found or was garbage collected.
/// </summary>
/// <param name="stageId"></param>
/// <returns></returns>
public SparkStageInfo GetStageInfo(int stageId)
{
return statusTrackerProxy.GetStageInfo(stageId);
}
}
public class SparkJobInfo
{
readonly int jobId;
readonly int[] stageIds;
readonly string status;
public SparkJobInfo(int jobId, int[] stageIds, string status)
{
this.jobId = jobId;
this.stageIds = stageIds;
this.status = status;
}
public int JobId { get { return jobId; } }
public int[] StageIds { get { return stageIds; } }
public string Status { get { return status; } }
}
public class SparkStageInfo
{
readonly int stageId;
readonly int currentAttemptId;
readonly long submissionTime;
readonly string name;
readonly int numTasks;
readonly int numActiveTasks;
readonly int numCompletedTasks;
readonly int numFailedTasks;
public SparkStageInfo(int stageId, int currentAttemptId, long submissionTime, string name, int numTasks, int numActiveTasks, int numCompletedTasks, int numFailedTasks)
{
this.stageId = stageId;
this.currentAttemptId = currentAttemptId;
this.submissionTime = submissionTime;
this.name = name;
this.numTasks = numTasks;
this.numActiveTasks = numActiveTasks;
this.numCompletedTasks = numCompletedTasks;
this.numFailedTasks = numFailedTasks;
}
public int StageId { get { return stageId; } }
public int CurrentAttemptId { get { return currentAttemptId; } }
public long SubmissionTime { get { return submissionTime; } }
public string Name { get { return name; } }
public int NumTasks { get { return numTasks; } }
public int NumActiveTasks { get { return numActiveTasks; } }
public int NumCompletedTasks { get { return numCompletedTasks; } }
public int NumFailedTasks { get { return numFailedTasks; } }
}
}

Просмотреть файл

@ -0,0 +1,69 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Core
{
public enum StorageLevelType
{
NONE,
DISK_ONLY,
DISK_ONLY_2,
MEMORY_ONLY,
MEMORY_ONLY_2,
MEMORY_ONLY_SER,
MEMORY_ONLY_SER_2,
MEMORY_AND_DISK,
MEMORY_AND_DISK_2,
MEMORY_AND_DISK_SER,
MEMORY_AND_DISK_SER_2,
OFF_HEAP
}
public class StorageLevel
{
internal static Dictionary<StorageLevelType, StorageLevel> storageLevel = new Dictionary<StorageLevelType, StorageLevel>
{
{StorageLevelType.NONE, new StorageLevel(false, false, false, false, 1)},
{StorageLevelType.DISK_ONLY, new StorageLevel(true, false, false, false, 1)},
{StorageLevelType.DISK_ONLY_2, new StorageLevel(true, false, false, false, 2)},
{StorageLevelType.MEMORY_ONLY, new StorageLevel(false, true, false, true, 1)},
{StorageLevelType.MEMORY_ONLY_2, new StorageLevel(false, true, false, true, 2)},
{StorageLevelType.MEMORY_ONLY_SER, new StorageLevel(false, true, false, false, 1)},
{StorageLevelType.MEMORY_ONLY_SER_2, new StorageLevel(false, true, false, false, 2)},
{StorageLevelType.MEMORY_AND_DISK, new StorageLevel(true, true, false, true, 1)},
{StorageLevelType.MEMORY_AND_DISK_2, new StorageLevel(true, true, false, true, 2)},
{StorageLevelType.MEMORY_AND_DISK_SER, new StorageLevel(true, true, false, false, 1)},
{StorageLevelType.MEMORY_AND_DISK_SER_2, new StorageLevel(true, true, false, false, 2)},
{StorageLevelType.OFF_HEAP, new StorageLevel(false, false, true, false, 1)},
};
internal bool useDisk;
internal bool useMemory;
internal bool useOffHeap;
internal bool deserialized;
internal int replication;
internal StorageLevel(bool useDisk, bool useMemory, bool useOffHeap, bool deserialized, int replication)
{
this.useDisk = useDisk;
this.useMemory = useMemory;
this.useOffHeap = useOffHeap;
this.deserialized = deserialized;
this.replication = replication;
}
public override string ToString()
{
return string.Format("{0}{1}{2}{3}{4} Replicated",
useDisk ? "Disk " : null,
useMemory ? "Memory " : null,
useOffHeap ? "Tachyon " : null,
deserialized ? "Deserialized " : "Serialized ",
replication);
}
}
}

Просмотреть файл

@ -0,0 +1,24 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Behavior of the bridge used for the IPC interop between JVM & CLR
/// </summary>
internal interface IJvmBridge : IDisposable
{
void Initialize(int portNo);
JvmObjectReference CallConstructor(string className, params object[] parameters);
object CallStaticJavaMethod(string className, string methodName, params object[] parameters);
object CallStaticJavaMethod(string className, string methodName);
object CallNonStaticJavaMethod(JvmObjectReference objectId, string methodName, params object[] parameters);
object CallNonStaticJavaMethod(JvmObjectReference objectId, string methodName);
}
}

Просмотреть файл

@ -0,0 +1,35 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Defines the behavior of socket implementation in SparkCLR
/// </summary>
public interface ISparkCLRSocket : IDisposable
{
void Initialize(int portNumber);
void Write(byte[] value);
void Write(int value);
void Write(long value);
void Write(string value);
byte[] ReadBytes(int length);
char ReadChar();
int ReadInt();
long ReadLong();
string ReadString();
string ReadString(int length);
double ReadDouble();
bool ReadBoolean();
IDisposable InitializeStream();
void Flush();
}
}

Просмотреть файл

@ -0,0 +1,144 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using Microsoft.Spark.CSharp.Services;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Implementation of IPC bridge between JVM & CLR
/// </summary>
internal class JvmBridge : IJvmBridge
{
private ISparkCLRSocket socket;
private ILoggerService logger = LoggerServiceFactory.GetLogger(typeof (JvmBridge));
public void Initialize(int portNumber)
{
socket = new SparkCLRSocket();
socket.Initialize(portNumber);
}
public JvmObjectReference CallConstructor(string className, params object[] parameters)
{
return new JvmObjectReference(CallJavaMethod(true, className, "<init>", parameters).ToString());
}
public object CallStaticJavaMethod(string className, string methodName, params object[] parameters)
{
return CallJavaMethod(true, className, methodName, parameters);
}
public object CallStaticJavaMethod(string className, string methodName)
{
return CallJavaMethod(true, className, methodName, new object[] {});
}
public object CallNonStaticJavaMethod(JvmObjectReference objectId, string methodName, params object[] parameters)
{
return CallJavaMethod(false, objectId, methodName, parameters);
}
public object CallNonStaticJavaMethod(JvmObjectReference objectId, string methodName)
{
return CallJavaMethod(false, objectId, methodName, new object[] {});
}
private object CallJavaMethod(bool isStatic, object classNameOrJvmObjectReference, string methodName, params object[] parameters)
{
object returnValue = null;
try
{
var overallPayload = PayloadHelper.BuildPayload(isStatic, classNameOrJvmObjectReference, methodName, parameters);
using (socket.InitializeStream())
{
socket.Write(overallPayload);
var isMethodCallFailed = socket.ReadInt();
//TODO - add boolean instead of int in the backend
if (isMethodCallFailed != 0)
{
throw new Exception("Method execution failed"); //TODO - add more info to the exception
}
var typeAsChar = socket.ReadChar();
switch (typeAsChar) //TODO - add support for other types
{
case 'n':
break;
case 'j':
returnValue = socket.ReadString();
break;
case 'c':
returnValue = socket.ReadString();
break;
case 'i':
returnValue = socket.ReadInt();
break;
case 'd':
returnValue = socket.ReadDouble();
break;
case 'b':
returnValue = socket.ReadBoolean();
break;
case 'l':
returnValue = ReadJvmObjectReferenceCollection();
break;
default:
throw new NotSupportedException(string.Format("Identifier for type {0} not supported", typeAsChar));
}
}
}
catch (Exception e)
{
logger.LogException(e);
throw;
}
return returnValue;
}
private object ReadJvmObjectReferenceCollection()
{
object returnValue;
var listItemTypeAsChar = socket.ReadChar();
switch (listItemTypeAsChar)
{
case 'j':
var jvmObjectReferenceList = new List<JvmObjectReference>();
var numOfItemsInList = socket.ReadInt();
for (int itemIndex = 0; itemIndex < numOfItemsInList; itemIndex++)
{
var itemIdentifier = socket.ReadString();
jvmObjectReferenceList.Add(new JvmObjectReference(itemIdentifier));
}
returnValue = jvmObjectReferenceList;
break;
default:
throw new NotSupportedException(
string.Format("Identifier for list item type {0} not supported",
listItemTypeAsChar));
}
return returnValue;
}
public void Dispose()
{
socket.Dispose();
}
}
}

Просмотреть файл

@ -0,0 +1,35 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Reference to object created in JVM
/// </summary>
[Serializable]
internal class JvmObjectReference
{
public string Id { get; private set; }
private DateTime creationTime;
public JvmObjectReference(string jvmReferenceId)
{
Id = jvmReferenceId;
creationTime = DateTime.UtcNow;
}
public override string ToString()
{
return Id;
}
public string GetDebugInfo()
{
var javaObjectReferenceForClassObject = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(this, "getClass").ToString());
var className = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(javaObjectReferenceForClassObject, "getName").ToString();
return string.Format("Java object reference id={0}, type name={1}, creation time (UTC)={2}", Id, className, creationTime.ToString("o"));
}
}
}

Просмотреть файл

@ -0,0 +1,219 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Help build the IPC payload for JVM calls from CLR
/// </summary>
internal class PayloadHelper
{
internal static byte[] BuildPayload(bool isStaticMethod, object classNameOrJvmObjectReference, string methodName, object[] parameters)
{
var isStaticMethodAsBytes = SerDe.ToBytes(isStaticMethod);
var objectOrClassIdBytes = ToPayloadBytes(classNameOrJvmObjectReference.ToString()); //class name or objectId sent as string
var methodNameBytes = ToPayloadBytes(methodName);
var parameterCountBytes = SerDe.ToBytes(parameters.Length);
var parametersBytes = ConvertParametersToBytes(parameters);
var payloadBytes = new List<byte[]>
{
isStaticMethodAsBytes,
objectOrClassIdBytes,
methodNameBytes,
parameterCountBytes,
parametersBytes
};
var payloadLength = GetPayloadLength(payloadBytes);
var payload = GetPayload(payloadBytes);
var payloadLengthBytes = SerDe.ToBytes(payloadLength);
var headerAndPayloadBytes = new byte[payloadLengthBytes.Length + payload.Length];
Array.Copy(payloadLengthBytes, 0, headerAndPayloadBytes, 0, payloadLengthBytes.Length);
Array.Copy(payload, 0, headerAndPayloadBytes, payloadLengthBytes.Length, payload.Length);
return headerAndPayloadBytes;
}
internal static byte[] ToPayloadBytes(string value)
{
var inputAsBytes = SerDe.ToBytes(value);
var lengthOfInputBytes = inputAsBytes.Length;
var byteRepresentationofInputLength = SerDe.ToBytes(lengthOfInputBytes);
var sendPayloadBytes = new byte[byteRepresentationofInputLength.Length + lengthOfInputBytes];
Array.Copy(byteRepresentationofInputLength, 0, sendPayloadBytes, 0, byteRepresentationofInputLength.Length);
Array.Copy(inputAsBytes, 0, sendPayloadBytes, byteRepresentationofInputLength.Length, inputAsBytes.Length);
return sendPayloadBytes;
}
internal static int GetPayloadLength(List<byte[]> payloadBytesList)
{
return payloadBytesList.Sum(payloadBytes => payloadBytes.Length);
}
internal static byte[] GetPayload(List<byte[]> payloadBytesList)
{
return payloadBytesList.SelectMany(byteArray => byteArray).ToArray();
}
internal static byte[] ConvertParametersToBytes(object[] parameters)
{
var paramtersBytes = new List<byte[]>();
foreach (var parameter in parameters)
{
if (parameter != null)
{
paramtersBytes.Add(GetTypeId(parameter.GetType()));
if (parameter is int)
{
paramtersBytes.Add(SerDe.ToBytes((int)parameter));
}
else if (parameter is long)
{
paramtersBytes.Add(SerDe.ToBytes((long)parameter));
}
else if (parameter is string)
{
paramtersBytes.Add(ToPayloadBytes(parameter.ToString()));
}
else if (parameter is bool)
{
paramtersBytes.Add(SerDe.ToBytes((bool)parameter));
}
else if (parameter is double)
{
paramtersBytes.Add(SerDe.ToBytes((double)parameter));
}
else if (parameter is byte[])
{
paramtersBytes.Add(SerDe.ToBytes(((byte[])parameter).Length));
paramtersBytes.Add((byte[])parameter);
}
else if (parameter is int[])
{
paramtersBytes.Add(GetTypeId(typeof(int)));
paramtersBytes.Add(SerDe.ToBytes(((int[])parameter).Length));
paramtersBytes.AddRange(((int[])parameter).Select(x => SerDe.ToBytes(x)));
}
else if (parameter is long[])
{
paramtersBytes.Add(GetTypeId(typeof(long)));
paramtersBytes.Add(SerDe.ToBytes(((long[])parameter).Length));
paramtersBytes.AddRange(((long[])parameter).Select(x => SerDe.ToBytes(x)));
}
else if (parameter is double[])
{
paramtersBytes.Add(GetTypeId(typeof(double)));
paramtersBytes.Add(SerDe.ToBytes(((double[])parameter).Length));
paramtersBytes.AddRange(((double[])parameter).Select(x => SerDe.ToBytes(x)));
}
else if (parameter is IEnumerable<byte[]>)
{
paramtersBytes.Add(GetTypeId(typeof(byte[])));
paramtersBytes.Add(SerDe.ToBytes(((IEnumerable<byte[]>)parameter).Count())); //TODO - Count() will traverse the collection - change interface?
foreach (var byteArray in (IEnumerable<byte[]>)parameter)
{
paramtersBytes.Add(SerDe.ToBytes(byteArray.Length));
paramtersBytes.Add(byteArray);
}
}
else if (parameter is IEnumerable<string>)
{
paramtersBytes.Add(GetTypeId(typeof(string)));
paramtersBytes.Add(SerDe.ToBytes(((IEnumerable<string>)parameter).Count())); //TODO - Count() will traverse the collection - change interface?
paramtersBytes.AddRange(from stringVal in (IEnumerable<string>)parameter select ToPayloadBytes(stringVal));
}
else if (parameter is IEnumerable<JvmObjectReference>)
{
paramtersBytes.Add(GetTypeId(typeof(JvmObjectReference)));
paramtersBytes.Add(SerDe.ToBytes(((IEnumerable<JvmObjectReference>)parameter).Count())); //TODO - Count() will traverse the collection - change interface?
paramtersBytes.AddRange(from jObj in (IEnumerable<JvmObjectReference>)parameter select ToPayloadBytes(jObj.Id));
}
else if (parameter is JvmObjectReference)
{
paramtersBytes.Add(ToPayloadBytes((parameter as JvmObjectReference).Id));
}
else
{
throw new NotSupportedException(string.Format("Type {0} is not supported", parameter.GetType()));
}
}
else
{
paramtersBytes.Add(new [] { Convert.ToByte('n') });
}
}
return paramtersBytes.SelectMany(byteArray => byteArray).ToArray();
}
internal static byte[] GetTypeId(Type type) //TODO - support other types
{
if (type == typeof(int))
{
return new [] { Convert.ToByte('i') };
}
if (type == typeof(long))
{
return new[] { Convert.ToByte('g') };
}
if (type == typeof(string))
{
return new [] { Convert.ToByte('c') };
}
if (type == typeof(bool))
{
return new [] { Convert.ToByte('b') };
}
if (type == typeof(double))
{
return new[] { Convert.ToByte('d') };
}
if (type == typeof(JvmObjectReference))
{
return new [] { Convert.ToByte('j') };
}
if (type == typeof(byte[]))
{
return new [] { Convert.ToByte('r') };
}
if (type == typeof(int[]) || type == typeof(long[]) || type == typeof(double[]))
{
return new[] { Convert.ToByte('l') };
}
if (typeof(IEnumerable<byte[]>).IsAssignableFrom(type))
{
return new [] { Convert.ToByte('l') };
}
if (typeof(IEnumerable<string>).IsAssignableFrom(type))
{
return new [] { Convert.ToByte('l') };
}
if (typeof(IEnumerable<JvmObjectReference>).IsAssignableFrom(type))
{
return new [] { Convert.ToByte('l') };
}
throw new NotSupportedException(string.Format("Type {0} not supported yet", type));
}
}
}

Просмотреть файл

@ -0,0 +1,80 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Text;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Serialization and Deserialization of data types between JVM & CLR
/// </summary>
public class SerDe //TODO - add ToBytes() for other types
{
public static byte[] ToBytes(bool value)
{
return new[] { System.Convert.ToByte(value) };
}
public static byte[] ToBytes(string value)
{
return Encoding.UTF8.GetBytes(value);
}
public static byte[] ToBytes(int value)
{
var byteRepresentationofInputLength = BitConverter.GetBytes(value);
Array.Reverse(byteRepresentationofInputLength);
return byteRepresentationofInputLength;
}
public static byte[] ToBytes(long value)
{
var byteRepresentationofInputLength = BitConverter.GetBytes(value);
Array.Reverse(byteRepresentationofInputLength);
return byteRepresentationofInputLength;
}
public static byte[] ToBytes(double value)
{
var byteRepresentationofInputLength = BitConverter.GetBytes(value);
Array.Reverse(byteRepresentationofInputLength);
return byteRepresentationofInputLength;
}
public static char ToChar(byte value)
{
return System.Convert.ToChar(value);
}
public static string ToString(byte[] value)
{
return Encoding.UTF8.GetString(value);
}
public static int ToInt(byte[] value)
{
return BitConverter.ToInt32(value, 0);
}
public static int Convert(int value)
{
var buffer = BitConverter.GetBytes(value);
Array.Reverse(buffer); //Netty byte order is BigEndian
return BitConverter.ToInt32(buffer, 0);
}
public static long Convert(long value)
{
var buffer = BitConverter.GetBytes(value);
Array.Reverse(buffer); //Netty byte order is BigEndian
return BitConverter.ToInt64(buffer, 0);
}
public static double Convert(double value)
{
var buffer = BitConverter.GetBytes(value);
Array.Reverse(buffer); //Netty byte order is BigEndian
return BitConverter.ToDouble(buffer, 0);
}
}
}

Просмотреть файл

@ -0,0 +1,158 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Sockets;
using Microsoft.Spark.CSharp.Services;
namespace Microsoft.Spark.CSharp.Interop.Ipc
{
/// <summary>
/// Wraps the socket implementation used by SparkCLR
/// </summary>
public class SparkCLRSocket : ISparkCLRSocket
{
private Socket socket;
private SparkCLRSocketStream stream;
public void Initialize(int portNumber)
{
socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
var addresses = Dns.GetHostAddresses("localhost");
socket.Connect(new IPEndPoint(addresses.First(a => a.AddressFamily == AddressFamily.InterNetwork) /*get IPv4 address*/, portNumber));
}
public IDisposable InitializeStream()
{
stream = new SparkCLRSocketStream(socket);
return stream;
}
public void Write(byte[] value)
{
stream.Writer.Write(value);
}
public void Write(int value)
{
stream.Writer.Write(SerDe.Convert(value));
}
public void Write(long value)
{
stream.Writer.Write(SerDe.Convert(value));
}
public void Write(string value)
{
byte[] buffer = SerDe.ToBytes(value);
Write(buffer.Length);
Write(buffer);
}
public byte[] ReadBytes(int length)
{
return stream.Reader.ReadBytes(length);
}
public char ReadChar()
{
return SerDe.ToChar(stream.Reader.ReadByte());
}
public int ReadInt()
{
return SerDe.Convert(stream.Reader.ReadInt32());
}
public long ReadLong()
{
byte[] buffer = stream.Reader.ReadBytes(8);
Array.Reverse(buffer);
return BitConverter.ToInt64(buffer, 0);
}
public string ReadString()
{
var length = SerDe.Convert(stream.Reader.ReadInt32());
var stringAsBytes = stream.Reader.ReadBytes(length);
return SerDe.ToString(stringAsBytes);
}
public string ReadString(int length)
{
var stringAsBytes = stream.Reader.ReadBytes(length);
return SerDe.ToString(stringAsBytes);
}
public double ReadDouble()
{
return SerDe.Convert(stream.Reader.ReadDouble());
}
public bool ReadBoolean()
{
return stream.Reader.ReadBoolean();
}
public void Dispose()
{
socket.Dispose();
}
public void Flush()
{
stream.Stream.Flush();
}
private class SparkCLRSocketStream : IDisposable
{
internal readonly BinaryReader Reader;
internal readonly BinaryWriter Writer;
internal readonly NetworkStream Stream;
private ILoggerService logger = LoggerServiceFactory.GetLogger(typeof(SparkCLRSocketStream));
internal SparkCLRSocketStream(Socket socket)
{
Stream = new NetworkStream(socket);
Reader = new BinaryReader(Stream);
Writer = new BinaryWriter(Stream);
}
public void Dispose()
{
try
{
Reader.Dispose();
}
catch (Exception e)
{
logger.LogException(e);
}
try
{
Writer.Dispose();
}
catch (Exception e)
{
logger.LogException(e);
}
try
{
Stream.Dispose();
}
catch (Exception e)
{
logger.LogException(e);
}
}
}
}
}

Просмотреть файл

@ -0,0 +1,176 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Runtime.CompilerServices;
using Microsoft.Spark.CSharp.Configuration;
using Microsoft.Spark.CSharp.Interop.Ipc;
using Microsoft.Spark.CSharp.Proxy;
using Microsoft.Spark.CSharp.Proxy.Ipc;
[assembly: InternalsVisibleTo("AdapterTest")]
namespace Microsoft.Spark.CSharp.Interop
{
/// <summary>
/// Contains everything needed to setup an environment for using C# with Spark
/// </summary>
public class SparkCLREnvironment : IDisposable
{
internal IJvmBridge jvmBridge;
internal static IJvmBridge JvmBridge
{
get
{
return Environment.jvmBridge;
}
}
internal ISparkConfProxy sparkConfProxy;
internal static ISparkConfProxy SparkConfProxy
{
get
{
return Environment.sparkConfProxy;
}
}
internal ISparkContextProxy sparkContextProxy;
internal static ISparkContextProxy SparkContextProxy
{
get
{
return Environment.sparkContextProxy;
}
}
//internal IStreamingContextProxy streamingContextProxy;
//internal static IStreamingContextProxy StreamingContextProxy
//{
// get
// {
// return Environment.streamingContextProxy;
// }
//}
internal ISqlContextProxy sqlContextProxy;
internal static ISqlContextProxy SqlContextProxy
{
get
{
return Environment.sqlContextProxy;
}
}
internal IConfigurationService configurationService;
internal static IConfigurationService ConfigurationService
{
get
{
return Environment.configurationService;
}
set
{
Environment.configurationService = value;
}
}
protected static SparkCLREnvironment Environment = new SparkCLREnvironment();
protected SparkCLREnvironment() { }
/// <summary>
/// Initializes and returns the environment for SparkCLR execution
/// </summary>
/// <returns></returns>
public static SparkCLREnvironment Initialize()
{
Environment.InitializeEnvironment();
return Environment;
}
/// <summary>
/// Disposes the socket used in the JVM-CLR bridge
/// </summary>
public void Dispose()
{
jvmBridge.Dispose();
}
protected virtual void InitializeEnvironment()
{
var proxyFactory = new ProxyFactory();
configurationService = new ConfigurationService();
sparkConfProxy = proxyFactory.GetSparkConfProxy();
sparkContextProxy = proxyFactory.GetSparkContextProxy();
//streamingContextProxy = new StreamingContextIpcProxy();
sqlContextProxy = proxyFactory.GetSqlContextProxy();
jvmBridge = new JvmBridge();
InitializeJvmBridge();
}
private void InitializeJvmBridge()
{
int portNo = ConfigurationService.BackendPortNumber;
if (portNo == 0) //fail early
{
throw new Exception("Port number is not set");
}
Console.WriteLine("CSharpBackend port number to be used in JvMBridge is " + portNo);//TODO - send to logger
jvmBridge.Initialize(portNo);
}
private class ProxyFactory
{
private readonly InteropType interopType;
internal ProxyFactory(InteropType interopType = InteropType.IPC)
{
this.interopType = interopType;
}
internal ISparkConfProxy GetSparkConfProxy()
{
switch (interopType)
{
case InteropType.IPC:
return new SparkConfIpcProxy();
default:
throw new NotImplementedException();
}
}
internal ISparkContextProxy GetSparkContextProxy()
{
switch (interopType)
{
case InteropType.IPC:
return new SparkContextIpcProxy();
default:
throw new NotImplementedException();
}
}
internal ISqlContextProxy GetSqlContextProxy()
{
switch (interopType)
{
case InteropType.IPC:
return new SqlContextIpcProxy();
default:
throw new NotImplementedException();
}
}
}
public enum InteropType
{
IPC
}
}
}

Просмотреть файл

@ -0,0 +1,49 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Sql;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface IDataFrameProxy
{
void RegisterTempTable(string tableName);
long Count();
string GetQueryExecution();
string GetExecutedPlan();
string GetShowString(int numberOfRows, bool truncate);
IStructTypeProxy GetSchema();
IRDDProxy ToJSON();
IRDDProxy ToRDD();
IColumnProxy GetColumn(string columnName);
object ToObjectSeq(List<object> objectList);
IColumnProxy ToColumnSeq(List<IColumnProxy> columnRefList);
IDataFrameProxy Select(IColumnProxy columnSequenceReference);
IDataFrameProxy Filter(string condition);
IGroupedDataProxy GroupBy(string firstColumnName, IColumnProxy otherColumnSequenceReference);
IGroupedDataProxy GroupBy(IColumnProxy columnSequenceReference);
IGroupedDataProxy GroupBy(object columnSequenceReference);
IDataFrameProxy Agg(IGroupedDataProxy scalaGroupedDataReference, Dictionary<string, string> columnNameAggFunctionDictionary);
IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, string joinColumnName);
IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, string[] joinColumnNames);
IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, IColumnProxy scalaColumnReference, string joinType);
}
internal interface IColumnProxy
{
IColumnProxy EqualsOperator(IColumnProxy secondColumn);
IColumnProxy UnaryOp(string name);
IColumnProxy FuncOp(string name);
IColumnProxy BinOp(string name, object other);
}
internal interface IGroupedDataProxy
{
}
}

Просмотреть файл

@ -0,0 +1,49 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface IRDDProxy
{
StorageLevel GetStorageLevel();
void Cache();
void Persist(StorageLevelType storageLevelType);
void Unpersist();
void Checkpoint();
bool IsCheckpointed { get; }
string GetCheckpointFile();
int GetNumPartitions();
IRDDProxy Sample(bool withReplacement, double fraction, long seed);
IRDDProxy[] RandomSplit(double[] weights, long seed);
IRDDProxy Union(IRDDProxy other);
IRDDProxy Intersection(IRDDProxy other);
IRDDProxy Cartesian(IRDDProxy other);
IRDDProxy Pipe(string command);
IRDDProxy Repartition(int numPartitions);
IRDDProxy Coalesce(int numPartitions, bool shuffle);
string Name { get; }
void SetName(string name);
IRDDProxy RandomSampleWithRange(double lb, double ub, long seed);
IRDDProxy SampleByKey(bool withReplacement, Dictionary<string, double> fractions, long seed);
IRDDProxy Zip(IRDDProxy other);
IRDDProxy ZipWithIndex();
IRDDProxy ZipWithUniqueId();
string ToDebugString();
void SaveAsNewAPIHadoopDataset(IEnumerable<KeyValuePair<string, string>> conf);
void SaveAsNewAPIHadoopFile(string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf);
void SaveAsHadoopDataset(IEnumerable<KeyValuePair<string, string>> conf);
void saveAsHadoopFile(string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf, string compressionCodecClass);
void SaveAsSequenceFile(string path, string compressionCodecClass);
void SaveAsTextFile(string path, string compressionCodecClass);
long Count();
int CollectAndServe();
int PartitionLength();
}
}

Просмотреть файл

@ -0,0 +1,23 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface ISparkConfProxy
{
void CreateSparkConf(bool loadDefaults = true);
void SetMaster(string master);
void SetAppName(string appName);
void SetSparkHome(string sparkHome);
void Set(string key, string value);
int GetInt(string key, int defaultValue);
string Get(string key, string defaultValue);
}
}

Просмотреть файл

@ -0,0 +1,57 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface ISparkContextProxy
{
void CreateSparkContext(string master, string appName, string sparkHome, ISparkConfProxy conf);
IColumnProxy CreateColumnFromName(string name);
IColumnProxy CreateFunction(string name, object self);
IColumnProxy CreateBinaryMathFunction(string name, object self, object other);
IColumnProxy CreateWindowFunction(string name);
void Accumulator(string host, int port);
void SetLogLevel(string logLevel);
string Version { get; }
long StartTime { get; }
int DefaultParallelism { get; }
int DefaultMinPartitions { get; }
void Stop();
IRDDProxy EmptyRDD<T>();
IRDDProxy Parallelize(IEnumerable<byte[]> values, int? numSlices);
IRDDProxy TextFile(string filePath, int minPartitions);
IRDDProxy WholeTextFiles(string filePath, int minPartitions);
IRDDProxy BinaryFiles(string filePath, int minPartitions);
IRDDProxy SequenceFile(string filePath, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, int minSplits, int batchSize);
IRDDProxy NewAPIHadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize);
IRDDProxy NewAPIHadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize);
IRDDProxy HadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize);
IRDDProxy HadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize);
IRDDProxy CheckpointFile(string filePath);
IRDDProxy Union<T>(IEnumerable<RDD<T>> rdds);
void AddFile(string path);
void SetCheckpointDir(string directory);
void SetJobGroup(string groupId, string description, bool interruptOnCancel);
void SetLocalProperty(string key, string value);
string GetLocalProperty(string key);
string SparkUser { get; }
void CancelJobGroup(string groupId);
void CancelAllJobs();
IStatusTrackerProxy StatusTracker { get; }
int RunJob(IRDDProxy rdd, IEnumerable<int> partitions, bool allowLocal);
string ReadBroadcastFromFile(string path, out long broadcastId);
void UnpersistBroadcast(string broadcastObjId, bool blocking);
IRDDProxy CreateCSharpRdd(IRDDProxy prefvJavaRddReference, byte[] command, Dictionary<string, string> environmentVariables, List<string> pythonIncludes, bool preservePartitioning, List<Broadcast> broadcastVariables, List<byte[]> accumulator);
IRDDProxy CreatePairwiseRDD<K, V>(IRDDProxy javaReferenceInByteArrayRdd, int numPartitions);
IRDDProxy CreateUserDefinedCSharpFunction(string name, byte[] command, string returnType);
}
}

Просмотреть файл

@ -0,0 +1,25 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Sql;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface ISqlContextProxy
{
void CreateSqlContext(ISparkContextProxy sparkContextProxy);
StructField CreateStructField(string name, string dataType, bool isNullable);
StructType CreateStructType(List<StructField> fields);
IDataFrameProxy ReaDataFrame(string path, StructType schema, Dictionary<string, string> options);
IDataFrameProxy JsonFile(string path);
IDataFrameProxy TextFile(string path, StructType schema, string delimiter);
IDataFrameProxy TextFile(string path, string delimiter, bool hasHeader, bool inferSchema);
IDataFrameProxy Sql(string query);
}
}

Просмотреть файл

@ -0,0 +1,22 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
namespace Microsoft.Spark.CSharp.Proxy
{
internal interface IStatusTrackerProxy
{
int[] GetJobIdsForGroup(string jobGroup);
int[] GetActiveStageIds();
int[] GetActiveJobsIds();
SparkJobInfo GetJobInfo(int jobId);
SparkStageInfo GetStageInfo(int stageId);
}
}

Просмотреть файл

@ -0,0 +1,29 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Proxy
{
interface IStructTypeProxy
{
List<IStructFieldProxy> GetStructTypeFields();
}
interface IStructDataTypeProxy
{
string GetDataTypeString();
string GetDataTypeSimpleString();
}
interface IStructFieldProxy
{
string GetStructFieldName();
IStructDataTypeProxy GetStructFieldDataType();
bool GetStructFieldIsNullable();
}
}

Просмотреть файл

@ -0,0 +1,273 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
namespace Microsoft.Spark.CSharp.Proxy.Ipc
{
internal class DataFrameIpcProxy : IDataFrameProxy
{
private readonly JvmObjectReference jvmDataFrameReference;
private readonly ISqlContextProxy sqlContextProxy;
internal DataFrameIpcProxy(JvmObjectReference jvmDataFrameReference, ISqlContextProxy sqlProxy)
{
this.jvmDataFrameReference = jvmDataFrameReference;
sqlContextProxy = sqlProxy;
}
public void RegisterTempTable(string tableName)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmDataFrameReference,
"registerTempTable", new object[] {tableName});
}
public long Count()
{
return
long.Parse(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "count").ToString());
}
public string GetQueryExecution()
{
var queryExecutionReference = GetQueryExecutionReference();
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(queryExecutionReference, "toString").ToString();
}
private JvmObjectReference GetQueryExecutionReference()
{
return
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "queryExecution").ToString());
}
public string GetExecutedPlan()
{
var queryExecutionReference = GetQueryExecutionReference();
var executedPlanReference =
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(queryExecutionReference, "executedPlan")
.ToString());
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(executedPlanReference, "toString", new object[] { }).ToString();
}
public string GetShowString(int numberOfRows, bool truncate)
{
return
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "showString",
new object[] {numberOfRows /*, truncate*/ }).ToString(); //1.4.1 does not support second param
}
public IStructTypeProxy GetSchema()
{
return
new StructTypeIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "schema").ToString()));
}
public IRDDProxy ToJSON()
{
return new RDDIpcProxy(
new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmDataFrameReference, "toJSON")),
"toJavaRDD")));
}
public IRDDProxy ToRDD()
{
return new RDDIpcProxy(
new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "dfToRowRDD", new object[] {jvmDataFrameReference})),
"toJavaRDD")));
}
public IColumnProxy GetColumn(string columnName)
{
return
new ColumnIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "col", new object[] {columnName}).ToString()));
}
public object ToObjectSeq(List<object> objectList)
{
var javaObjectReferenceList = objectList.Cast<JvmObjectReference>().ToList();
return
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils",
"toSeq", new object[] {javaObjectReferenceList}).ToString());
}
public IColumnProxy ToColumnSeq(List<IColumnProxy> columnRefList)
{
var javaObjectReferenceList = columnRefList.Select(s => (s as ColumnIpcProxy).ScalaColumnReference).ToList().Cast<JvmObjectReference>();
return
new ColumnIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils",
"toSeq", new object[] { javaObjectReferenceList }).ToString()));
}
public IDataFrameProxy Select(IColumnProxy columnSequenceReference)
{
return
new DataFrameIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "select",
new object[] { (columnSequenceReference as ColumnIpcProxy).ScalaColumnReference }).ToString()), sqlContextProxy);
}
public IDataFrameProxy Filter(string condition)
{
return
new DataFrameIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "filter", new object[] { condition }).ToString()), sqlContextProxy);
}
public IGroupedDataProxy GroupBy(string firstColumnName, IColumnProxy otherColumnSequenceReference)
{
return
new GroupedDataIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "groupBy",
new object[] { firstColumnName, (otherColumnSequenceReference as ColumnIpcProxy).ScalaColumnReference }).ToString()));
}
public IGroupedDataProxy GroupBy(IColumnProxy columnSequenceReference)
{
return
new GroupedDataIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "groupBy",
new object[] { (columnSequenceReference as ColumnIpcProxy).ScalaColumnReference}).ToString()));
}
public IGroupedDataProxy GroupBy(object columnSequenceReference)
{
return
new GroupedDataIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "groupBy",
new object[] { columnSequenceReference as JvmObjectReference }).ToString()));
}
public IDataFrameProxy Agg(IGroupedDataProxy scalaGroupedDataReference, Dictionary<string, string> columnNameAggFunctionDictionary)
{
var mapReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallConstructor("java.util.HashMap").ToString());
foreach (var key in columnNameAggFunctionDictionary.Keys)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(mapReference, "put", new object[] { key, columnNameAggFunctionDictionary[key]});
}
return
new DataFrameIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
(scalaGroupedDataReference as GroupedDataIpcProxy).ScalaGroupedDataReference, "agg", new object[] { mapReference }).ToString()), sqlContextProxy);
}
public IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, string joinColumnName)
{
return
new DataFrameIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmDataFrameReference, "join", new object[]
{
(otherScalaDataFrameReference as DataFrameIpcProxy).jvmDataFrameReference,
joinColumnName
}).ToString()
), sqlContextProxy);
}
public IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, string[] joinColumnNames)
{
throw new NotSupportedException("Not supported in 1.4.1");
//TODO - uncomment this in 1.5
//var stringSequenceReference = new JvmObjectReference(
// SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "toSeq", new object[] { joinColumnNames }).ToString());
//return
// new JvmObjectReference(
// SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(scalaDataFrameReference, "join", new object[]
// {
// otherScalaDataFrameReference,
// stringSequenceReference
// }).ToString()
// );
}
public IDataFrameProxy Join(IDataFrameProxy otherScalaDataFrameReference, IColumnProxy scalaColumnReference, string joinType)
{
return
new DataFrameIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
jvmDataFrameReference, "join",
new object[]
{
(otherScalaDataFrameReference as DataFrameIpcProxy).jvmDataFrameReference,
(scalaColumnReference as ColumnIpcProxy).ScalaColumnReference,
joinType
}).ToString()), sqlContextProxy);
}
}
internal class ColumnIpcProxy : IColumnProxy
{
private readonly JvmObjectReference scalaColumnReference;
internal JvmObjectReference ScalaColumnReference { get { return scalaColumnReference; } }
internal ColumnIpcProxy(JvmObjectReference colReference)
{
scalaColumnReference = colReference;
}
public IColumnProxy EqualsOperator(IColumnProxy secondColumn)
{
return
new ColumnIpcProxy(new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(
scalaColumnReference, "equalTo",
new object[] { (secondColumn as ColumnIpcProxy).scalaColumnReference }).ToString()));
}
public IColumnProxy UnaryOp(string name)
{
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(scalaColumnReference, name)));
}
public IColumnProxy FuncOp(string name)
{
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.functions", name, scalaColumnReference)));
}
public IColumnProxy BinOp(string name, object other)
{
if (other is ColumnIpcProxy)
other = (other as ColumnIpcProxy).scalaColumnReference;
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(scalaColumnReference, name, other)));
}
}
internal class GroupedDataIpcProxy : IGroupedDataProxy
{
private readonly JvmObjectReference scalaGroupedDataReference;
internal JvmObjectReference ScalaGroupedDataReference { get { return scalaGroupedDataReference; } }
internal GroupedDataIpcProxy(JvmObjectReference gdRef)
{
scalaGroupedDataReference = gdRef;
}
}
}

Просмотреть файл

@ -0,0 +1,260 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
namespace Microsoft.Spark.CSharp.Proxy.Ipc
{
internal class RDDIpcProxy : IRDDProxy
{
private readonly JvmObjectReference jvmRddReference;
internal JvmObjectReference JvmRddReference
{
get { return jvmRddReference; }
}
public string Name
{
get
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "name");
}
}
public bool IsCheckpointed
{
get
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return (bool)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "isCheckpointed");
}
}
public RDDIpcProxy(JvmObjectReference jvmRddReference)
{
this.jvmRddReference = jvmRddReference;
}
public long Count()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return long.Parse(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "count").ToString());
}
public int CollectAndServe()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return int.Parse(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "collectAndServe", new object[] { rdd }).ToString());
}
public IRDDProxy Union(IRDDProxy javaRddReferenceOther)
{
var jref = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "union", new object[] { (javaRddReferenceOther as RDDIpcProxy).jvmRddReference }).ToString());
return new RDDIpcProxy(jref);
}
public int PartitionLength()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
var partitions = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "partitions", new object[] { });
return int.Parse(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("java.lang.reflect.Array", "getLength", new object[] { partitions }).ToString());
}
public IRDDProxy Coalesce(int numPartitions, bool shuffle)
{
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "coalesce", new object[] { numPartitions, shuffle })));
}
public IRDDProxy Sample(bool withReplacement, double fraction, long seed)
{
var jref = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "sample", new object[] { withReplacement, fraction, seed }));
return new RDDIpcProxy(jref);
}
public IRDDProxy[] RandomSplit(double[] weights, long seed)
{
return ((List<JvmObjectReference>)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "randomSplit", new object[] { weights, seed }))
.Select(obj => new RDDIpcProxy(obj)).ToArray();
}
public IRDDProxy RandomSampleWithRange(double lb, double ub, long seed)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "randomSampleWithRange", new object[] { lb, ub, seed })));
}
public void Cache()
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "cache");
}
public void Persist(StorageLevelType storageLevelType)
{
var jstorageLevel = GetJavaStorageLevel(storageLevelType);
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "persist", new object[] { jstorageLevel });
}
public void Unpersist()
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "unpersist");
}
public void Checkpoint()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "checkpoint");
}
public string GetCheckpointFile()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "getCheckpointFile");
}
public int GetNumPartitions()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return ((List<JvmObjectReference>)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "partitions")).Count;
}
public IRDDProxy Intersection(IRDDProxy other)
{
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "intersection", new object[] { (other as RDDIpcProxy).jvmRddReference })));
}
public IRDDProxy Repartition(int numPartitions)
{
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "repartition", new object[] { numPartitions })));
}
public IRDDProxy Cartesian(IRDDProxy other)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
var otherRdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod((other as RDDIpcProxy).jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "cartesian", new object[] { otherRdd })));
}
public IRDDProxy Pipe(string command)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "pipe", new object[] { command })));
}
public void SetName(string name)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "setName", new object[] { name });
}
public IRDDProxy SampleByKey(bool withReplacement, Dictionary<string, double> fractions, long seed)
{
var jfractions = SparkContextIpcProxy.GetJavaMap(fractions) as JvmObjectReference;
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "sampleByKey", new object[] { withReplacement, jfractions, seed })));
}
public string ToDebugString()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "toDebugString");
}
public IRDDProxy Zip(IRDDProxy other)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "zip", new object[] { (other as RDDIpcProxy).jvmRddReference })));
}
public IRDDProxy ZipWithIndex()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "zipWithIndex")));
}
public IRDDProxy ZipWithUniqueId()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "zipWithUniqueId")));
}
public void SaveAsNewAPIHadoopDataset(IEnumerable<KeyValuePair<string, string>> conf)
{
var jconf = SparkContextIpcProxy.GetJavaMap<string, string>(conf);
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "saveAsHadoopDataset", new object[] { jvmRddReference, false, jconf, null, null, true });
}
public void SaveAsNewAPIHadoopFile(string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf)
{
var jconf = SparkContextIpcProxy.GetJavaMap<string, string>(conf);
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "saveAsNewAPIHadoopFile", new object[] { jvmRddReference, false, path, outputFormatClass, keyClass, valueClass, null, null, jconf });
}
public void SaveAsHadoopDataset(IEnumerable<KeyValuePair<string, string>> conf)
{
var jconf = SparkContextIpcProxy.GetJavaMap<string, string>(conf);
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "saveAsHadoopDataset", new object[] { jvmRddReference, false, jconf, null, null, false });
}
public void saveAsHadoopFile(string path, string outputFormatClass, string keyClass, string valueClass, IEnumerable<KeyValuePair<string, string>> conf, string compressionCodecClass)
{
var jconf = SparkContextIpcProxy.GetJavaMap<string, string>(conf);
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "saveAsHadoopFile", new object[] { jvmRddReference, false, path, outputFormatClass, keyClass, valueClass, null, null, jconf, compressionCodecClass });
}
public void SaveAsSequenceFile(string path, string compressionCodecClass)
{
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "SaveAsSequenceFile", new object[] { jvmRddReference, false, path, compressionCodecClass });
}
public void SaveAsTextFile(string path, string compressionCodecClass)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
if (!string.IsNullOrEmpty(compressionCodecClass))
{
var codec = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("java.lang.Class", "forName", new object[] { compressionCodecClass }));
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "saveAsTextFile", new object[] { path, codec });
}
else
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "saveAsTextFile", new object[] { path });
}
}
public StorageLevel GetStorageLevel()
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmRddReference, "rdd"));
var storageLevel = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(rdd, "getStorageLevel"));
return new StorageLevel
(
(bool)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(storageLevel, "useDisk"),
(bool)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(storageLevel, "useMemory"),
(bool)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(storageLevel, "useOffHeap"),
(bool)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(storageLevel, "deserialized"),
(int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(storageLevel, "replication")
);
}
private JvmObjectReference GetJavaStorageLevel(StorageLevelType storageLevelType)
{
return new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.java.StorageLevels", "create",
new object[]
{
StorageLevel.storageLevel[storageLevelType].useDisk,
StorageLevel.storageLevel[storageLevelType].useMemory,
StorageLevel.storageLevel[storageLevelType].useOffHeap,
StorageLevel.storageLevel[storageLevelType].deserialized,
StorageLevel.storageLevel[storageLevelType].replication
}).ToString());
}
}
}

Просмотреть файл

@ -0,0 +1,59 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
namespace Microsoft.Spark.CSharp.Proxy
{
internal class SparkConfIpcProxy : ISparkConfProxy
{
private JvmObjectReference jvmSparkConfReference;
internal JvmObjectReference JvmSparkConfReference
{
get { return jvmSparkConfReference; }
}
public void CreateSparkConf(bool loadDefaults = true)
{
jvmSparkConfReference = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.SparkConf", new object[] { loadDefaults });
}
public void SetMaster(string master)
{
jvmSparkConfReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "setMaster", new object[] { master }).ToString());
}
public void SetAppName(string appName)
{
jvmSparkConfReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "setAppName", new object[] { appName }).ToString());
}
public void SetSparkHome(string sparkHome)
{
jvmSparkConfReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "setSparkHome", new object[] { sparkHome }).ToString());
}
public void Set(string key, string value)
{
jvmSparkConfReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "set", new object[] { key, value }).ToString());
}
public int GetInt(string key, int defaultValue)
{
return int.Parse(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "getInt", new object[] { key, defaultValue }).ToString());
}
public string Get(string key, string defaultValue)
{
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkConfReference, "get", new object[] { key, defaultValue }).ToString();
}
}
}

Просмотреть файл

@ -0,0 +1,353 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
using Microsoft.Spark.CSharp.Proxy.Ipc;
namespace Microsoft.Spark.CSharp.Proxy
{
internal class SparkContextIpcProxy : ISparkContextProxy
{
private JvmObjectReference jvmSparkContextReference;
private JvmObjectReference jvmJavaContextReference;
private JvmObjectReference jvmAccumulatorReference;
internal JvmObjectReference JvmSparkContextReference
{
get { return jvmSparkContextReference; }
}
public void CreateSparkContext(string master, string appName, string sparkHome, ISparkConfProxy conf)
{
object[] args = (new object[] { master, appName, sparkHome, (conf == null ? null : (conf as SparkConfIpcProxy).JvmSparkConfReference) }).Where(x => x != null).ToArray();
jvmSparkContextReference = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.SparkContext", args);
jvmJavaContextReference = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.api.java.JavaSparkContext", new object[] { jvmSparkContextReference });
}
public void SetLogLevel(string logLevel)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "setLogLevel", new object[] { logLevel });
}
private string version;
public string Version
{
get { if (version == null) { version = (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "version"); } return version; }
}
private long? startTime;
public long StartTime
{
get { if (startTime == null) { startTime = (long)(double)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "startTime"); } return (long)startTime; }
}
private int? defaultParallelism;
public int DefaultParallelism
{
get { if (defaultParallelism == null) { defaultParallelism = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "defaultParallelism"); } return (int)defaultParallelism; }
}
private int? defaultMinPartitions;
public int DefaultMinPartitions
{
get { if (defaultMinPartitions == null) { defaultMinPartitions = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "defaultMinPartitions"); } return (int)defaultMinPartitions; }
}
public void Accumulator(string host, int port)
{
jvmAccumulatorReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "accumulator",
SparkCLREnvironment.JvmBridge.CallConstructor("java.util.ArrayList"),
SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.api.python.PythonAccumulatorParam", host, port)
));
}
public void Stop()
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "stop", new object[] { });
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("SparkCLRHandler", "stopBackend", new object[] { }); //className and methodName hardcoded in CSharpBackendHandler
}
public IRDDProxy EmptyRDD<T>()
{
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "emptyRDD"));
return new RDDIpcProxy(jvmRddReference);
}
//TODO - this implementation is slow. Replace with call to createRDDFromArray() in CSharpRDD
public IRDDProxy Parallelize(IEnumerable<byte[]> values, int? numSlices)
{
JvmObjectReference jvalues = GetJavaList(values);
var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "parallelize", new object[] { jvalues, numSlices }).ToString());
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy TextFile(string filePath, int minPartitions)
{
var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "textFile", new object[] { filePath, minPartitions }).ToString());
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy WholeTextFiles(string filePath, int minPartitions)
{
var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "wholeTextFiles", new object[] { filePath, minPartitions }).ToString());
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy BinaryFiles(string filePath, int minPartitions)
{
var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "binaryFiles", new object[] { filePath, minPartitions }).ToString());
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy SequenceFile(string filePath, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, int minSplits, int batchSize)
{
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "sequenceFile",
new object[] { jvmJavaContextReference, filePath, keyClass, valueClass, keyConverterClass, valueConverterClass, minSplits, batchSize }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy NewAPIHadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize)
{
var jconf = GetJavaMap<string, string>(conf) as JvmObjectReference;
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "newAPIHadoopFile",
new object[] { jvmJavaContextReference, filePath, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, jconf, batchSize }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy NewAPIHadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize)
{
var jconf = GetJavaMap<string, string>(conf) as JvmObjectReference;
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "newAPIHadoopRDD",
new object[] { jvmJavaContextReference, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, jconf, batchSize }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy HadoopFile(string filePath, string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize)
{
var jconf = GetJavaMap<string, string>(conf) as JvmObjectReference;
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "hadoopFile",
new object[] { jvmJavaContextReference, filePath, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, jconf, batchSize }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy HadoopRDD(string inputFormatClass, string keyClass, string valueClass, string keyConverterClass, string valueConverterClass, IEnumerable<KeyValuePair<string, string>> conf, int batchSize)
{
var jconf = GetJavaMap<string, string>(conf) as JvmObjectReference;
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "hadoopRDD",
new object[] { jvmJavaContextReference, inputFormatClass, keyClass, valueClass, keyConverterClass, valueConverterClass, jconf, batchSize }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy CheckpointFile(string filePath)
{
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "checkpointFile", new object[] { filePath }));
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy Union<T>(IEnumerable<RDD<T>> rdds)
{
int count = rdds == null ? 0 : rdds.Count();
if (count == 0)
return null;
if (count == 1)
return rdds.First().RddProxy;
var jfirst = (rdds.First().RddProxy as RDDIpcProxy).JvmRddReference;
var jrest = GetJavaList(rdds.TakeWhile((r, i) => i > 0).Select(r => (r.RddProxy as RDDIpcProxy).JvmRddReference)) as JvmObjectReference;
var jvmRddReference = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "union", new object[] { jfirst, jrest }));
return new RDDIpcProxy(jvmRddReference);
}
public void AddFile(string path)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkContextReference, "addFile", new object[] { path });
}
public void SetCheckpointDir(string directory)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkContextReference, "setCheckpointDir", new object[] { directory });
}
public void SetJobGroup(string groupId, string description, bool interruptOnCancel)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "setCheckpointDir", new object[] { groupId, description, interruptOnCancel });
}
public void SetLocalProperty(string key, string value)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "setLocalProperty", new object[] { key, value });
}
public string GetLocalProperty(string key)
{
return (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "getLocalProperty", new object[] { key });
}
public string SparkUser
{
get
{
return (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSparkContextReference, "sparkUser");
}
}
public void CancelJobGroup(string groupId)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "cancelJobGroup", new object[] { groupId });
}
public void CancelAllJobs()
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "cancelAllJobs");
}
private IStatusTrackerProxy statusTracker;
public IStatusTrackerProxy StatusTracker
{
get
{
if (statusTracker == null)
{
statusTracker = new StatusTrackerIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmJavaContextReference, "statusTracker")));
}
return statusTracker;
}
}
public IRDDProxy CreatePairwiseRDD<K, V>(IRDDProxy jvmReferenceOfByteArrayRdd, int numPartitions)
{
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod((jvmReferenceOfByteArrayRdd as RDDIpcProxy).JvmRddReference, "rdd"));
var pairwiseRdd = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.api.python.PairwiseRDD", rdd);
var pairRddJvmReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(pairwiseRdd, "asJavaPairRDD", new object[] { }).ToString());
var jpartitionerJavaReference = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.api.python.PythonPartitioner", new object[] { numPartitions, 0 });
var partitionedPairRddJvmReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(pairRddJvmReference, "partitionBy", new object[] { jpartitionerJavaReference }).ToString());
var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "valueOfPair", new object[] { partitionedPairRddJvmReference }).ToString());
//var jvmRddReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(partitionedRddJvmReference, "rdd", new object[] { }).ToString());
return new RDDIpcProxy(jvmRddReference);
}
public IRDDProxy CreateCSharpRdd(IRDDProxy prevJvmRddReference, byte[] command, Dictionary<string, string> environmentVariables, List<string> pythonIncludes, bool preservesPartitioning, List<Broadcast> broadcastVariables, List<byte[]> accumulator)
{
var hashTableReference = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.Hashtable", new object[] { });
var arrayListReference = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.ArrayList", new object[] { });
var jbroadcastVariables = GetJavaList<JvmObjectReference>(broadcastVariables.Select(x => new JvmObjectReference(x.broadcastObjId)));
var rdd = new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod((prevJvmRddReference as RDDIpcProxy).JvmRddReference, "rdd"));
var csRdd = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.api.csharp.CSharpRDD",
new object[]
{
rdd, command, hashTableReference, arrayListReference, preservesPartitioning,
SparkCLREnvironment.ConfigurationService.GetCSharpRDDExternalProcessName(),
"1.0",
jbroadcastVariables, jvmAccumulatorReference
});
return new RDDIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(csRdd, "asJavaRDD")));
}
public IRDDProxy CreateUserDefinedCSharpFunction(string name, byte[] command, string returnType = "string")
{
var jSqlContext = SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.sql.SQLContext", new object[] { (SparkCLREnvironment.SparkContextProxy as SparkContextIpcProxy).jvmSparkContextReference });
var jDataType = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jSqlContext, "parseDataType", new object[] { "\"" + returnType + "\"" });
var hashTableReference = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.Hashtable", new object[] { });
var arrayListReference = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.ArrayList", new object[] { });
return new RDDIpcProxy(SparkCLREnvironment.JvmBridge.CallConstructor("org.apache.spark.sql.UserDefinedPythonFunction",
new object[]
{
name, command, hashTableReference, arrayListReference,
SparkCLREnvironment.ConfigurationService.GetCSharpRDDExternalProcessName(),
"1.0",
arrayListReference, null, jDataType
}));
}
public int RunJob(IRDDProxy rdd, IEnumerable<int> partitions, bool allowLocal)
{
var jpartitions = GetJavaList<int>(partitions);
return int.Parse(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "runJob", new object[] { jvmSparkContextReference, (rdd as RDDIpcProxy).JvmRddReference, jpartitions, allowLocal }).ToString());
}
public string ReadBroadcastFromFile(string path, out long broadcastId)
{
string broadcastObjId = (string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.api.python.PythonRDD", "readBroadcastFromFile", new object[] { jvmJavaContextReference, path });
broadcastId = (long)(double)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(new JvmObjectReference(broadcastObjId), "id");
return broadcastObjId;
}
public void UnpersistBroadcast(string broadcastObjId, bool blocking)
{
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(new JvmObjectReference(broadcastObjId), "unpersist", new object[] { blocking });
}
public IColumnProxy CreateColumnFromName(string name)
{
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.functions", "col", name)));
}
public IColumnProxy CreateFunction(string name, object self)
{
if (self is ColumnIpcProxy)
self = (self as ColumnIpcProxy).ScalaColumnReference;
else if (self is IColumnProxy[])
self = GetJavaSeq<JvmObjectReference>((self as IColumnProxy[]).Select(x => (x as ColumnIpcProxy).ScalaColumnReference));
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.functions", name, self)));
}
public IColumnProxy CreateBinaryMathFunction(string name, object self, object other)
{
if (self is ColumnIpcProxy)
self = (self as ColumnIpcProxy).ScalaColumnReference;
if (other is ColumnIpcProxy)
other = (self as ColumnIpcProxy).ScalaColumnReference;
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.functions", name, self, other)));
}
public IColumnProxy CreateWindowFunction(string name)
{
return new ColumnIpcProxy(new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.functions", name)));
}
public static JvmObjectReference GetJavaMap<K, V>(IEnumerable<KeyValuePair<K, V>> enumerable)
{
var jmap = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.Hashtable", new object[] { });
if (enumerable != null)
{
foreach (var item in enumerable)
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jmap, "put", new object[] { item.Key, item.Value });
}
return jmap;
}
public static JvmObjectReference GetJavaSet<T>(IEnumerable<T> enumerable)
{
var jset = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.HashSet", new object[] { });
if (enumerable != null)
{
foreach (var item in enumerable)
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jset, "add", new object[] { item });
}
return jset;
}
public static JvmObjectReference GetJavaList<T>(IEnumerable<T> enumerable)
{
var jlist = SparkCLREnvironment.JvmBridge.CallConstructor("java.util.ArrayList", new object[] { });
if (enumerable != null)
{
foreach (var item in enumerable)
SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jlist, "add", new object[] { item });
}
return jlist;
}
public JvmObjectReference GetJavaSeq<T>(IEnumerable<T> enumerable)
{
return new JvmObjectReference((string)SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "toSeq", GetJavaList<T>(enumerable)));
}
}
}

Просмотреть файл

@ -0,0 +1,103 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
using Microsoft.Spark.CSharp.Sql;
namespace Microsoft.Spark.CSharp.Proxy.Ipc
{
internal class SqlContextIpcProxy : ISqlContextProxy
{
private JvmObjectReference jvmSqlContextReference;
private ISparkContextProxy sparkContextProxy;
public void CreateSqlContext(ISparkContextProxy scProxy)
{
sparkContextProxy = scProxy;
jvmSqlContextReference = new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "createSQLContext", new object[] { (sparkContextProxy as SparkContextIpcProxy).JvmSparkContextReference }).ToString());
}
public StructField CreateStructField(string name, string dataType, bool isNullable)
{
return new StructField(
new StructFieldIpcProxy(
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod(
"org.apache.spark.sql.api.csharp.SQLUtils", "createStructField",
new object[] {name, dataType, isNullable}).ToString()
)
)
);
}
public StructType CreateStructType(List<StructField> fields)
{
var fieldsReference = fields.Select(s => (s.StructFieldProxy as StructFieldIpcProxy).JvmStructFieldReference).ToList().Cast<JvmObjectReference>();
//var javaObjectReferenceList = objectList.Cast<JvmObjectReference>().ToList();
var seq =
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils",
"toSeq", new object[] { fieldsReference }).ToString());
return new StructType(
new StructTypeIpcProxy(
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "createStructType", new object[] { seq }).ToString()
)
)
);
}
public IDataFrameProxy ReaDataFrame(string path, StructType schema, Dictionary<string, string> options)
{
//parameter Dictionary<string, string> options is not used right now - it is meant to be passed on to data sources
return new DataFrameIpcProxy(
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod("org.apache.spark.sql.api.csharp.SQLUtils", "loadDF", new object[] { jvmSqlContextReference, path, (schema.StructTypeProxy as StructTypeIpcProxy).JvmStructTypeReference }).ToString()
), this
);
}
public IDataFrameProxy JsonFile(string path)
{
var javaDataFrameReference = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSqlContextReference, "jsonFile", new object[] {path});
var javaObjectReferenceForDataFrame = new JvmObjectReference(javaDataFrameReference.ToString());
return new DataFrameIpcProxy(javaObjectReferenceForDataFrame, this);
}
public IDataFrameProxy TextFile(string path, StructType schema, string delimiter)
{
return new DataFrameIpcProxy(
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod(
"org.apache.spark.sql.api.csharp.SQLUtils", "loadTextFile",
new object[] {jvmSqlContextReference, path, delimiter, (schema.StructTypeProxy as StructTypeIpcProxy).JvmStructTypeReference}).ToString()
), this
);
}
public IDataFrameProxy TextFile(string path, string delimiter, bool hasHeader, bool inferSchema)
{
return new DataFrameIpcProxy(
new JvmObjectReference(
SparkCLREnvironment.JvmBridge.CallStaticJavaMethod(
"org.apache.spark.sql.api.csharp.SQLUtils", "loadTextFile",
new object[] {jvmSqlContextReference, path, hasHeader, inferSchema}).ToString()
), this
);
}
public IDataFrameProxy Sql(string sqlQuery)
{
var javaDataFrameReference = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmSqlContextReference, "sql", new object[] { sqlQuery });
var javaObjectReferenceForDataFrame = new JvmObjectReference(javaDataFrameReference.ToString());
return new DataFrameIpcProxy(javaObjectReferenceForDataFrame, this);
}
}
}

Просмотреть файл

@ -0,0 +1,67 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
namespace Microsoft.Spark.CSharp.Proxy.Ipc
{
internal class StatusTrackerIpcProxy : IStatusTrackerProxy
{
private JvmObjectReference jvmStatusTrackerReference;
public StatusTrackerIpcProxy(JvmObjectReference jStatusTracker)
{
this.jvmStatusTrackerReference = jStatusTracker;
}
public int[] GetJobIdsForGroup(string jobGroup)
{
return (int[])SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStatusTrackerReference, "getJobIdsForGroup", new object[] { jobGroup });
}
public int[] GetActiveStageIds()
{
return (int[])SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStatusTrackerReference, "getActiveStageIds");
}
public int[] GetActiveJobsIds()
{
return (int[])SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStatusTrackerReference, "getActiveJobsIds");
}
public SparkJobInfo GetJobInfo(int jobId)
{
var jobInfoId = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStatusTrackerReference, "getJobInfo", new object[] { jobId });
if (jobInfoId == null)
return null;
JvmObjectReference jJobInfo = new JvmObjectReference((string)jobInfoId);
int[] stageIds = (int[])SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jJobInfo, "stageIds");
string status = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jJobInfo, "status").ToString();
return new SparkJobInfo(jobId, stageIds, status);
}
public SparkStageInfo GetStageInfo(int stageId)
{
var stageInfoId = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStatusTrackerReference, "getStageInfo", new object[] { stageId });
if (stageInfoId == null)
return null;
JvmObjectReference jStageInfo = new JvmObjectReference((string)stageInfoId);
int currentAttemptId = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "currentAttemptId");
int submissionTime = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "submissionTime");
string name = (string)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "name");
int numTasks = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "numTasks");
int numActiveTasks = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "numActiveTasks");
int numCompletedTasks = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "numCompletedTasks");
int numFailedTasks = (int)SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jStageInfo, "numFailedTasks");
return new SparkStageInfo(stageId, currentAttemptId, (long)submissionTime, name, numTasks, numActiveTasks, numCompletedTasks, numFailedTasks);
}
}
}

Просмотреть файл

@ -0,0 +1,80 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Interop.Ipc;
namespace Microsoft.Spark.CSharp.Proxy.Ipc
{
internal class StructTypeIpcProxy : IStructTypeProxy
{
private readonly JvmObjectReference jvmStructTypeReference;
internal JvmObjectReference JvmStructTypeReference
{
get { return jvmStructTypeReference; }
}
internal StructTypeIpcProxy(JvmObjectReference jvmStructTypeReference)
{
this.jvmStructTypeReference = jvmStructTypeReference;
}
public List<IStructFieldProxy> GetStructTypeFields()
{
var fieldsReferenceList = SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructTypeReference, "fields");
return (fieldsReferenceList as List<JvmObjectReference>).Select(s => new StructFieldIpcProxy(s)).Cast<IStructFieldProxy>().ToList();
}
}
internal class StructDataTypeIpcProxy : IStructDataTypeProxy
{
internal readonly JvmObjectReference jvmStructDataTypeReference;
internal StructDataTypeIpcProxy(JvmObjectReference jvmStructDataTypeReference)
{
this.jvmStructDataTypeReference = jvmStructDataTypeReference;
}
public string GetDataTypeString()
{
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructDataTypeReference, "toString").ToString();
}
public string GetDataTypeSimpleString()
{
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructDataTypeReference, "simpleString").ToString();
}
}
internal class StructFieldIpcProxy : IStructFieldProxy
{
private readonly JvmObjectReference jvmStructFieldReference;
internal JvmObjectReference JvmStructFieldReference { get { return jvmStructFieldReference; } }
internal StructFieldIpcProxy(JvmObjectReference jvmStructFieldReference)
{
this.jvmStructFieldReference = jvmStructFieldReference;
}
public string GetStructFieldName()
{
return SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructFieldReference, "name").ToString();
}
public IStructDataTypeProxy GetStructFieldDataType()
{
return new StructDataTypeIpcProxy(new JvmObjectReference(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructFieldReference, "dataType").ToString()));
}
public bool GetStructFieldIsNullable()
{
return bool.Parse(SparkCLREnvironment.JvmBridge.CallNonStaticJavaMethod(jvmStructFieldReference, "nullable").ToString());
}
}
}

Просмотреть файл

@ -0,0 +1,62 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Services
{
/// <summary>
/// This logger service will be used if the C# driver app did not configure a logger.
/// Right now it just prints out the messages to Console
/// </summary>
public class DefaultLoggerService : ILoggerService
{
internal static DefaultLoggerService BootstrappingLoggerService = new DefaultLoggerService(typeof (Type));
public ILoggerService GetLoggerInstance(Type type)
{
return new DefaultLoggerService(type);
}
private Type type;
private DefaultLoggerService(Type t)
{
type = t;
}
public void LogDebug(string message)
{
Log("Debug", message);
}
public void LogInfo(string message)
{
Log("Info", message);
}
public void LogWarn(string message)
{
Log("Warn", message);
}
public void LogFatal(string message)
{
Log("Fatal", message);
}
public void LogError(string message)
{
Log("Error", message);
}
public void LogException(Exception e)
{
Log("Exception", string.Format("{0}{1}{2}", e.Message, Environment.NewLine, e.StackTrace));
}
private void Log(string level, string message)
{
Console.WriteLine("[{0}] [{1}] [{2}] [{3}] {4}", DateTime.UtcNow.ToString("o"), Environment.MachineName, level, type.Name, message);
}
}
}

Просмотреть файл

@ -0,0 +1,19 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Services
{
public interface ILoggerService
{
ILoggerService GetLoggerInstance(Type type);
void LogDebug(string message);
void LogInfo(string message);
void LogWarn(string message);
void LogFatal(string message);
void LogError(string message);
void LogException(Exception e);
}
}

Просмотреть файл

@ -0,0 +1,99 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Services
{
//TODO - add log4net NuGet and complete impl
public class Log4NetLoggerService : ILoggerService
{
/*
private ILog logger;
private const string exceptionLogDelimiter = "*******************************************************************************************************************************";
*/
public Log4NetLoggerService(Type type)
{
//logger = log4net.LogManager.GetLogger(type);
throw new NotImplementedException();
}
public void LogDebug(string message)
{
//logger.Debug(message);
throw new NotImplementedException();
}
public void LogInfo(string message)
{
//logger.Info(message);
throw new NotImplementedException();
}
public void LogWarn(string message)
{
//logger.Warn(message);
throw new NotImplementedException();
}
public void LogFatal(string message)
{
//logger.Fatal(message);
throw new NotImplementedException();
}
public void LogError(string message)
{
//logger.Error(message);
throw new NotImplementedException();
}
public void LogException(Exception e)
{
throw new NotImplementedException();
/*
if (e.GetType() != typeof(AggregateException))
{
logger.Error(e.Message);
logger.Error(string.Format("{5}{0}{1}{2}{3}{4}", exceptionLogDelimiter, Environment.NewLine, e.StackTrace, Environment.NewLine, exceptionLogDelimiter, Environment.NewLine));
var innerException = e.InnerException;
if (innerException != null)
{
logger.Error("Inner exception 1 details....");
logger.Error(innerException.Message);
logger.Error(string.Format("{5}{0}{1}{2}{3}{4}", exceptionLogDelimiter, Environment.NewLine, innerException.StackTrace, Environment.NewLine, exceptionLogDelimiter, Environment.NewLine));
var innerException2 = innerException.InnerException;
if (innerException2 != null)
{
logger.Error("Inner exception 2 details....");
logger.Error(innerException2.Message);
logger.Error(string.Format("{5}{0}{1}{2}{3}{4}", exceptionLogDelimiter, Environment.NewLine, innerException2.StackTrace, Environment.NewLine, exceptionLogDelimiter, Environment.NewLine));
}
}
}
else
{
LogError("Aggregate Exception thrown...");
AggregateException aggregateException = e as AggregateException;
int count = 1;
foreach (var innerException in aggregateException.InnerExceptions)
{
logger.Error(string.Format("Aggregate exception #{0} details....", count++));
LogException(innerException);
}
}
*/
}
public ILoggerService GetLoggerInstance(Type type)
{
throw new NotImplementedException();
}
}
}

Просмотреть файл

@ -0,0 +1,22 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Services
{
public class LoggerServiceFactory
{
private static ILoggerService loggerService = DefaultLoggerService.BootstrappingLoggerService;
public static void SetLoggerService(ILoggerService loggerServiceOverride)
{
loggerService = loggerServiceOverride;
}
public static ILoggerService GetLogger(Type type)
{
return loggerService.GetLoggerInstance(type);
}
}
}

Просмотреть файл

@ -0,0 +1,155 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Proxy;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Sql
{
public class Column
{
private IColumnProxy columnProxy;
internal IColumnProxy ColumnProxy
{
get
{
return columnProxy;
}
}
internal Column(IColumnProxy columnProxy)
{
this.columnProxy = columnProxy;
}
public static implicit operator Column(string name)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateColumnFromName(name));
}
public static Column operator !(Column self)
{
return new Column(self.columnProxy.FuncOp("not"));
}
public static Column operator -(Column self)
{
return new Column(self.columnProxy.FuncOp("negate"));
}
public static Column operator +(Column self, object other)
{
return new Column(self.columnProxy.BinOp("plus", other));
}
public static Column operator -(Column self, object other)
{
return new Column(self.columnProxy.BinOp("minus", other));
}
public static Column operator *(Column self, object other)
{
return new Column(self.columnProxy.BinOp("multiply", other));
}
public static Column operator /(Column self, object other)
{
return new Column(self.columnProxy.BinOp("divide", other));
}
public static Column operator %(Column self, object other)
{
return new Column(self.columnProxy.BinOp("mod", other));
}
public static Column operator ==(Column self, object other)
{
return new Column(self.columnProxy.BinOp("equalTo", other));
}
public static Column operator !=(Column self, object other)
{
return new Column(self.columnProxy.BinOp("notEqual", other));
}
public static Column operator <(Column self, object other)
{
return new Column(self.columnProxy.BinOp("lt", other));
}
public static Column operator <=(Column self, object other)
{
return new Column(self.columnProxy.BinOp("leq", other));
}
public static Column operator >=(Column self, object other)
{
return new Column(self.columnProxy.BinOp("geq", other));
}
public static Column operator >(Column self, object other)
{
return new Column(self.columnProxy.BinOp("gt", other));
}
public static Column operator |(Column self, object other)
{
return new Column(self.columnProxy.BinOp("bitwiseOR", other));
}
public static Column operator &(Column self, object other)
{
return new Column(self.columnProxy.BinOp("bitwiseAND", other));
}
public static Column operator ^(Column self, object other)
{
return new Column(self.columnProxy.BinOp("bitwiseXOR", other));
}
/// <summary>
/// SQL like expression.
/// </summary>
/// <param name="literal"></param>
/// <returns></returns>
public Column Like(string literal)
{
return new Column(this.columnProxy.BinOp("like", literal));
}
/// <summary>
/// SQL RLIKE expression (LIKE with Regex).
/// </summary>
/// <param name="literal"></param>
/// <returns></returns>
public Column RLike(string literal)
{
return new Column(this.columnProxy.BinOp("rlike", literal));
}
/// <summary>
/// String starts with another string literal.
/// </summary>
/// <param name="literal"></param>
/// <returns></returns>
public Column StartsWith(Column other)
{
return new Column(this.columnProxy.BinOp("startsWith", other));
}
/// <summary>
/// String ends with another string literal.
/// </summary>
/// <param name="literal"></param>
/// <returns></returns>
public Column EndsWith(Column other)
{
return new Column(this.columnProxy.BinOp("endsWith", other));
}
}
}

Просмотреть файл

@ -0,0 +1,369 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Proxy;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Sql
{
/// <summary>
/// A distributed collection of data organized into named columns.
///
/// See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
/// </summary>
public class DataFrame
{
private IDataFrameProxy dataFrameProxy;
private readonly SparkContext sparkContext;
private StructType schema;
internal SparkContext SparkContext
{
get
{
return sparkContext;
}
}
internal IDataFrameProxy DataFrameProxy
{
get { return dataFrameProxy; }
}
public StructType Schema
{
get { return schema ?? (schema = new StructType(dataFrameProxy.GetSchema())); }
}
public Column this[string columnName]
{
get
{
return new Column(dataFrameProxy.GetColumn(columnName));
}
}
internal DataFrame(IDataFrameProxy dataFrameProxy, SparkContext sparkContext)
{
this.dataFrameProxy = dataFrameProxy;
this.sparkContext = sparkContext;
}
/// <summary>
/// Registers this DataFrame as a temporary table using the given name. The lifetime of this
/// temporary table is tied to the SqlContext that was used to create this DataFrame.
/// </summary>
/// <param name="tableName">Name of the table</param>
public void RegisterTempTable(string tableName)
{
dataFrameProxy.RegisterTempTable(tableName);
}
/// <summary>
/// Number of rows in the DataFrame
/// </summary>
/// <returns>row count</returns>
public long Count()
{
return dataFrameProxy.Count();
}
/// <summary>
/// Displays rows of the DataFrame in tabular form
/// </summary>
/// <param name="numberOfRows">Number of rows to display - default 20</param>
/// <param name="truncate">Indicates if strings more than 20 characters long will be truncated</param>
public void Show(int numberOfRows = 20, bool truncate = true)
{
Console.WriteLine(dataFrameProxy.GetShowString(numberOfRows, truncate));
}
/// <summary>
/// Prints the schema information of the DataFrame
/// </summary>
public void ShowSchema()
{
List<string> nameTypeList = Schema.Fields.Select(structField => string.Format("{0}:{1}", structField.Name, structField.DataType.SimpleString())).ToList();
Console.WriteLine(string.Join(", ", nameTypeList));
}
public IEnumerable<Row> Collect()
{
throw new NotImplementedException();
}
/// <summary>
/// Converts the DataFrame to RDD of byte[]
/// </summary>
/// <returns>resulting RDD</returns>
public RDD<byte[]> ToRDD() //RDD created using byte representation of GenericRow objects
{
return new RDD<byte[]>(dataFrameProxy.ToRDD(), sparkContext);
}
/// <summary>
/// Returns the content of the DataFrame as RDD of JSON strings
/// </summary>
/// <returns>resulting RDD</returns>
public RDD<string> ToJSON()
{
var stringRddReference = dataFrameProxy.ToJSON();
return new RDD<string>(stringRddReference, sparkContext);
}
/// <summary>
/// Prints the plans (logical and physical) to the console for debugging purposes
/// </summary>
/// <param name="extended">if true prints both query plan and execution plan; otherwise just prints query plan</param>
public void Explain(bool extended = false) //TODO - GetQueryExecution is called in JVM twice if extendd = true - fix that
{
Console.WriteLine(dataFrameProxy.GetQueryExecution());
if (extended)
{
Console.WriteLine(dataFrameProxy.GetExecutedPlan());
}
}
/// <summary>
/// Select a list of columns
/// </summary>
/// <param name="columnNames">name of the columns</param>
/// <returns>DataFrame with selected columns</returns>
public DataFrame Select(params string[] columnNames)
{
List<IColumnProxy> columnReferenceList = columnNames.Select(columnName => dataFrameProxy.GetColumn(columnName)).ToList();
IColumnProxy columnReferenceSeq = dataFrameProxy.ToColumnSeq(columnReferenceList);
return new DataFrame(dataFrameProxy.Select(columnReferenceSeq), sparkContext);
}
/// <summary>
/// Select a list of columns
/// </summary>
/// <param name="columnNames"></param>
/// <returns></returns>
public DataFrame Select(params Column[] columns)
{
List<IColumnProxy> columnReferenceList = columns.Select(column => column.ColumnProxy).ToList();
IColumnProxy columnReferenceSeq = dataFrameProxy.ToColumnSeq(columnReferenceList);
return new DataFrame(dataFrameProxy.Select(columnReferenceSeq), sparkContext);
}
/// <summary>
/// Filters rows using the given condition
/// </summary>
/// <param name="condition"></param>
/// <returns></returns>
public DataFrame Where(string condition)
{
return Filter(condition);
}
/// <summary>
/// Filters rows using the given condition
/// </summary>
/// <param name="condition"></param>
/// <returns></returns>
public DataFrame Filter(string condition)
{
return new DataFrame(dataFrameProxy.Filter(condition), sparkContext);
}
/// <summary>
/// Groups the DataFrame using the specified columns, so we can run aggregation on them.
/// </summary>
/// <param name="columnNames"></param>
/// <returns></returns>
public GroupedData GroupBy(params string[] columnNames)
{
if (columnNames.Length == 0)
{
throw new NotSupportedException("Invalid number of columns");
}
string firstColumnName = columnNames[0];
string[] otherColumnNames = columnNames.Skip(1).ToArray();
List<IColumnProxy> otherColumnReferenceList = otherColumnNames.Select(columnName => dataFrameProxy.GetColumn(columnName)).ToList();
IColumnProxy otherColumnReferenceSeq = dataFrameProxy.ToColumnSeq(otherColumnReferenceList);
var scalaGroupedDataReference = dataFrameProxy.GroupBy(firstColumnName, otherColumnReferenceSeq);
return new GroupedData(scalaGroupedDataReference, this);
}
private GroupedData GroupBy()
{
object otherColumnReferenceSeq = dataFrameProxy.ToObjectSeq(new List<object>());
var scalaGroupedDataReference = dataFrameProxy.GroupBy(otherColumnReferenceSeq);
return new GroupedData(scalaGroupedDataReference, this);
}
/// <summary>
/// Aggregates on the DataFrame for the given column-aggregate function mapping
/// </summary>
/// <param name="columnNameAggFunctionDictionary"></param>
/// <returns></returns>
public DataFrame Agg(Dictionary<string, string> columnNameAggFunctionDictionary)
{
return GroupBy().Agg(columnNameAggFunctionDictionary);
}
/// <summary>
/// Join with another DataFrame
/// </summary>
/// <param name="otherDataFrame">DataFrame to join with</param>
/// <returns>Joined DataFrame</returns>
public DataFrame Join(DataFrame otherDataFrame) //cartesian join
{
throw new NotImplementedException();
}
/// <summary>
/// Join with another DataFrame
/// </summary>
/// <param name="otherDataFrame">DataFrame to join with</param>
/// <returns>Joined DataFrame</returns>
public DataFrame Join(DataFrame otherDataFrame, string joinColumnName) //inner equi join using given column name //need aliasing for self join
{
return new DataFrame(
dataFrameProxy.Join(otherDataFrame.dataFrameProxy, joinColumnName),
sparkContext);
}
/// <summary>
/// Join with another DataFrame
/// </summary>
/// <param name="otherDataFrame">DataFrame to join with</param>
/// <returns>Joined DataFrame</returns>
public DataFrame Join(DataFrame otherDataFrame, string[] joinColumnNames) //inner equi join using given column name //need aliasing for self join
{
return new DataFrame(
dataFrameProxy.Join(otherDataFrame.dataFrameProxy, joinColumnNames),
sparkContext);
}
/// <summary>
/// Join with another DataFrame
/// </summary>
/// <param name="otherDataFrame">DataFrame to join with</param>
/// <returns>Joined DataFrame</returns>
public DataFrame Join(DataFrame otherDataFrame, Column joinExpression, JoinType joinType = null)
{
if (joinType == null)
{
joinType = JoinType.Inner;
}
return
new DataFrame(dataFrameProxy.Join(otherDataFrame.dataFrameProxy, joinExpression.ColumnProxy, joinType.Value), sparkContext);
}
}
//TODO - complete impl
public class Row
{
}
public class JoinType
{
public string Value { get; private set; }
private JoinType(string value)
{
Value = value;
}
private static readonly JoinType InnerJoinType = new JoinType("inner");
private static readonly JoinType OuterJoinType = new JoinType("outer");
private static readonly JoinType LeftOuterJoinType = new JoinType("left_outer");
private static readonly JoinType RightOuterJoinType = new JoinType("right_outer");
private static readonly JoinType LeftSemiJoinType = new JoinType("leftsemi");
public static JoinType Inner
{
get
{
return InnerJoinType;
}
}
public static JoinType Outer
{
get
{
return OuterJoinType;
}
}
public static JoinType LeftOuter
{
get
{
return LeftOuterJoinType;
}
}
public static JoinType RightOuter
{
get
{
return RightOuterJoinType;
}
}
public static JoinType LeftSemi
{
get
{
return LeftSemiJoinType;
}
}
}
public class Column
{
private IColumnProxy columnProxy;
internal IColumnProxy ColumnProxy
{
get
{
return columnProxy;
}
}
internal Column(IColumnProxy columnProxy)
{
this.columnProxy = columnProxy;
}
public static Column operator ==(Column firstColumn, Column secondColumn)
{
return new Column(firstColumn.columnProxy.EqualsOperator(secondColumn.columnProxy));
}
public static Column operator !=(Column firstColumn, Column secondColumn)
{
throw new NotImplementedException();
}
}
public class GroupedData
{
private IGroupedDataProxy groupedDataProxy;
private DataFrame dataFrame;
internal GroupedData(IGroupedDataProxy groupedDataProxy, DataFrame dataFrame)
{
this.groupedDataProxy = groupedDataProxy;
this.dataFrame = dataFrame;
}
public DataFrame Agg(Dictionary<string, string> columnNameAggFunctionDictionary)
{
return new DataFrame(dataFrame.DataFrameProxy.Agg(groupedDataProxy, columnNameAggFunctionDictionary), dataFrame.SparkContext);
}
}
}

Просмотреть файл

@ -0,0 +1,552 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Sql
{
/// <summary>
/// not applicable yet - it is for UDF to be used in DataFrame
/// </summary>
public class Functions
{
#region functions
public static Column Lit(object column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("lit", column));
}
public static Column Col(string colName)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("col", colName));
}
public static Column Column(string colName)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("column", colName));
}
public static Column Asc(string columnName)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("asc", columnName));
}
public static Column Desc(string columnName)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("desc", columnName));
}
public static Column Upper(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("upper", column.ColumnProxy));
}
public static Column Lower(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("lower", column.ColumnProxy));
}
public static Column Sqrt(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("sqrt", column.ColumnProxy));
}
public static Column Abs(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("abs", column.ColumnProxy));
}
public static Column Max(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("max", column.ColumnProxy));
}
public static Column Min(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("min", column.ColumnProxy));
}
public static Column First(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("first", column.ColumnProxy));
}
public static Column Last(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("last", column.ColumnProxy));
}
public static Column Count(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("count", column.ColumnProxy));
}
public static Column Sum(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("sum", column.ColumnProxy));
}
public static Column Avg(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("avg", column.ColumnProxy));
}
public static Column Mean(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("mean", column.ColumnProxy));
}
public static Column SumDistinct(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("sumDistinct", column.ColumnProxy));
}
public static Column Array(params Column[] columns)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("array", columns.Select(x => x.ColumnProxy)));
}
public static Column Coalesce(params Column[] columns)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("coalesce", columns.Select(x => x.ColumnProxy)));
}
public static Column CountDistinct(params Column[] columns)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("countDistinct", columns.Select(x => x.ColumnProxy)));
}
public static Column Struct(params Column[] columns)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("struct", columns.Select(x => x.ColumnProxy)));
}
public static Column ApproxCountDistinct(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("approxCountDistinct", column));
}
public static Column Explode(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("explode", column));
}
public static Column Rand(long seed)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("rand", seed));
}
public static Column Randn(long seed)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("randn", seed));
}
public static Column Ntile(int n)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("ntile", n));
}
#endregion
#region unary math functions
public static Column Acos(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("acos", column.ColumnProxy));
}
public static Column Asin(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("asin", column.ColumnProxy));
}
public static Column Atan(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("atan", column.ColumnProxy));
}
public static Column Cbrt(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("cbrt", column.ColumnProxy));
}
public static Column Ceil(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("ceil", column.ColumnProxy));
}
public static Column Cos(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("cos", column.ColumnProxy));
}
public static Column Cosh(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("cosh", column.ColumnProxy));
}
/// <summary>
/// Computes the exponential of the given value.
/// </summary>
/// <param name="column"></param>
/// <returns></returns>
public static Column Exp(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("exp", column.ColumnProxy));
}
/// <summary>
/// Computes the exponential of the given value minus one.
/// </summary>
/// <param name="column"></param>
/// <returns></returns>
public static Column Expm1(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("expm1", column.ColumnProxy));
}
public static Column Floor(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("floor", column.ColumnProxy));
}
public static Column Log(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("log", column.ColumnProxy));
}
public static Column Log10(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("log10", column.ColumnProxy));
}
public static Column Log1p(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("log1p", column.ColumnProxy));
}
public static Column Rint(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("rint", column.ColumnProxy));
}
public static Column Signum(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("signum", column.ColumnProxy));
}
public static Column Sin(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("sin", column.ColumnProxy));
}
public static Column Sinh(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("sinh", column.ColumnProxy));
}
public static Column Tan(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("tan", column.ColumnProxy));
}
public static Column Tanh(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("tanh", column.ColumnProxy));
}
public static Column ToDegrees(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("toDegrees", column.ColumnProxy));
}
public static Column ToRadians(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("toRadians", column.ColumnProxy));
}
public static Column BitwiseNOT(Column column)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateFunction("bitwiseNOT", column.ColumnProxy));
}
#endregion
#region binary math functions
public static Column Atan2(Column leftColumn, Column rightColumn)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("atan2", leftColumn.ColumnProxy, rightColumn.ColumnProxy));
}
public static Column Hypot(Column leftColumn, Column rightColumn)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("hypot", leftColumn.ColumnProxy, rightColumn.ColumnProxy));
}
public static Column Hypot(Column leftColumn, double rightValue)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("hypot", leftColumn.ColumnProxy, rightValue));
}
public static Column Hypot(double leftValue, Column rightColumn)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("hypot", leftValue, rightColumn.ColumnProxy));
}
public static Column Pow(Column leftColumn, Column rightColumn)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("pow", leftColumn.ColumnProxy, rightColumn.ColumnProxy));
}
public static Column Pow(Column leftColumn, double rightValue)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("pow", leftColumn.ColumnProxy, rightValue));
}
public static Column Pow(double leftValue, Column rightColumn)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("pow", leftValue, rightColumn.ColumnProxy));
}
public static Column ApproxCountDistinct(Column column, double rsd)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("approxCountDistinct", column, rsd));
}
public static Column When(Column condition, object value)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("when", condition, value));
}
public static Column Lag(Column column, int offset)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("lag", column, offset));
}
public static Column Lead(Column column, int offset)
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateBinaryMathFunction("lead", column, offset));
}
#endregion
#region window functions
public static Column rowNumber()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("rowNumber"));
}
public static Column DenseRank()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("denseRank"));
}
public static Column Rank()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("rank"));
}
public static Column CumeDist()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("cumeDist"));
}
public static Column PercentRank()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("percentRank"));
}
public static Column MonotonicallyIncreasingId()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("monotonicallyIncreasingId"));
}
public static Column SparkPartitionId()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("sparkPartitionId"));
}
public static Column Rand()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("rand"));
}
public static Column Randn()
{
return new Column(CSharpSparkEnvironment.SparkContextProxy.CreateWindowFunction("randn"));
}
#endregion
#region udf
public static UserDefinedFunction<RT> Udf<RT>(Func<RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1>(Func<A1, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2>(Func<A1, A2, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3>(Func<A1, A2, A3, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4>(Func<A1, A2, A3, A4, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5>(Func<A1, A2, A3, A4, A5, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5, A6>(Func<A1, A2, A3, A4, A5, A6, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5, A6>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5, A6, A7>(Func<A1, A2, A3, A4, A5, A6, A7, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5, A6, A7, A8>(Func<A1, A2, A3, A4, A5, A6, A7, A8, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9>(Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9>(f).Execute);
}
public static UserDefinedFunction<RT> Udf<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10>(Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, RT> f)
{
return new UserDefinedFunction<RT>(new UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10>(f).Execute);
}
#endregion
}
/// <summary>
/// only used in SqlContext.RegisterFunction for now
/// </summary>
/// <typeparam name="RT"></typeparam>
[Serializable]
internal class UdfHelper<RT>
{
private readonly Func<RT> func;
internal UdfHelper(Func<RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func()).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1>
{
private readonly Func<A1, RT> func;
internal UdfHelper(Func<A1, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2>
{
private readonly Func<A1, A2, RT> func;
internal UdfHelper(Func<A1, A2, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3>
{
private readonly Func<A1, A2, A3, RT> func;
internal UdfHelper(Func<A1, A2, A3, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4>
{
private readonly Func<A1, A2, A3, A4, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5>
{
private readonly Func<A1, A2, A3, A4, A5, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5, A6>
{
private readonly Func<A1, A2, A3, A4, A5, A6, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, A6, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]), (A6)(a[5]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7>
{
private readonly Func<A1, A2, A3, A4, A5, A6, A7, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, A6, A7, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]), (A6)(a[5]), (A7)(a[6]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8>
{
private readonly Func<A1, A2, A3, A4, A5, A6, A7, A8, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, A6, A7, A8, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]), (A6)(a[5]), (A7)(a[6]), (A8)(a[7]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9>
{
private readonly Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]), (A6)(a[5]), (A7)(a[6]), (A8)(a[7]), (A9)(a[8]))).Cast<dynamic>();
}
}
[Serializable]
internal class UdfHelper<RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10>
{
private readonly Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, RT> func;
internal UdfHelper(Func<A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, RT> f)
{
this.func = f;
}
internal IEnumerable<dynamic> Execute(int pid, IEnumerable<dynamic> input)
{
return input.Select(a => func((A1)(a[0]), (A2)(a[1]), (A3)(a[2]), (A4)(a[3]), (A5)(a[4]), (A6)(a[5]), (A7)(a[6]), (A8)(a[7]), (A9)(a[8]), (A10)(a[9]))).Cast<dynamic>();
}
}
public class UserDefinedFunction<RT>
{
private Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> func;
private string name;
internal UserDefinedFunction(Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>> f, string name = null)
{
this.func = f;
this.name = name;
}
private void CreateJavaUdf()
{
CSharpSparkEnvironment.SparkContextProxy.CreateUserDefinedCSharpFunction(func.GetType().Name, SparkContext.BuildCommand(func, SerializedMode.Row, SerializedMode.Row), "StringType");
}
}
}

Просмотреть файл

@ -0,0 +1,109 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Proxy;
namespace Microsoft.Spark.CSharp.Sql
{
/// <summary>
/// The entry point for working with structured data (rows and columns) in Spark.
/// Allows the creation of [[DataFrame]] objects as well as the execution of SQL queries.
/// </summary>
public class SqlContext
{
private ISqlContextProxy sqlContextProxy;
private SparkContext sparkContext;
public SqlContext(SparkContext sparkContext)
{
this.sparkContext = sparkContext;
SetSqlContextProxy();
sqlContextProxy.CreateSqlContext(sparkContext.SparkContextProxy);
}
private void SetSqlContextProxy()
{
sqlContextProxy = SparkCLREnvironment.SqlContextProxy;
}
/// <summary>
/// Loads a dataframe the source path using the given schema and options
/// </summary>
/// <param name="path"></param>
/// <param name="schema"></param>
/// <param name="options"></param>
/// <returns></returns>
public DataFrame ReadDataFrame(string path, StructType schema, Dictionary<string, string> options)
{
return new DataFrame(sqlContextProxy.ReaDataFrame(path, schema, options), sparkContext);
}
public DataFrame CreateDataFrame(RDD<byte[]> rdd, StructType schema)
{
throw new NotImplementedException();
}
/// <summary>
/// Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'
/// </summary>
/// <param name="sqlQuery"></param>
/// <returns></returns>
public DataFrame Sql(string sqlQuery)
{
return new DataFrame(sqlContextProxy.Sql(sqlQuery), sparkContext);
}
/// <summary>
/// Loads a JSON file (one object per line), returning the result as a DataFrame
/// It goes through the entire dataset once to determine the schema.
/// </summary>
/// <param name="path">path to JSON file</param>
/// <returns></returns>
public DataFrame JsonFile(string path)
{
return new DataFrame(sqlContextProxy.JsonFile(path), sparkContext);
}
/// <summary>
/// Loads a JSON file (one object per line) and applies the given schema
/// </summary>
/// <param name="path">path to JSON file</param>
/// <param name="schema">schema to use</param>
/// <returns></returns>
public DataFrame JsonFile(string path, StructType schema)
{
throw new NotImplementedException();
}
/// <summary>
/// Loads text file with the specific column delimited using the given schema
/// </summary>
/// <param name="path">path to text file</param>
/// <param name="schema">schema to use</param>
/// <param name="delimiter">delimiter to use</param>
/// <returns></returns>
public DataFrame TextFile(string path, StructType schema, string delimiter =",")
{
return new DataFrame(sqlContextProxy.TextFile(path, schema, delimiter), sparkContext);
}
/// <summary>
/// Loads a text file (one object per line), returning the result as a DataFrame
/// </summary>
/// <param name="path">path to text file</param>
/// <param name="delimiter">delimited to use</param>
/// <param name="hasHeader">indicates if the text file has a header row</param>
/// <param name="inferSchema">indicates if every row has to be read to infer the schema; if false, columns will be strings</param>
/// <returns></returns>
public DataFrame TextFile(string path, string delimiter = ",", bool hasHeader = false, bool inferSchema = false)
{
return new DataFrame(sqlContextProxy.TextFile(path, delimiter, hasHeader, inferSchema), sparkContext);
}
}
}

Просмотреть файл

@ -0,0 +1,129 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System.Collections.Generic;
using System.Linq;
using Microsoft.Spark.CSharp.Interop;
using Microsoft.Spark.CSharp.Proxy;
using Microsoft.Spark.CSharp.Proxy.Ipc;
namespace Microsoft.Spark.CSharp.Sql
{
/// <summary>
/// Schema of DataFrame
/// </summary>
public class StructType
{
private IStructTypeProxy structTypeProxy;
internal IStructTypeProxy StructTypeProxy
{
get
{
return structTypeProxy;
}
}
public List<StructField> Fields //TODO - avoid calling method everytime
{
get
{
var structTypeFieldJvmObjectReferenceList =
structTypeProxy.GetStructTypeFields();
var structFieldList = new List<StructField>(structTypeFieldJvmObjectReferenceList.Count);
structFieldList.AddRange(
structTypeFieldJvmObjectReferenceList.Select(
structTypeFieldJvmObjectReference => new StructField(structTypeFieldJvmObjectReference)));
return structFieldList;
}
}
internal StructType(IStructTypeProxy structTypeProxy)
{
this.structTypeProxy = structTypeProxy;
}
public static StructType CreateStructType(List<StructField> structFields)
{
return SparkCLREnvironment.SqlContextProxy.CreateStructType(structFields);
}
}
/// <summary>
/// Schema for DataFrame column
/// </summary>
public class StructField
{
private IStructFieldProxy structFieldProxy;
internal IStructFieldProxy StructFieldProxy
{
get
{
return structFieldProxy;
}
}
public string Name
{
get
{
return structFieldProxy.GetStructFieldName();
}
}
public DataType DataType
{
get
{
return new DataType(structFieldProxy.GetStructFieldDataType());
}
}
public bool IsNullable
{
get
{
return structFieldProxy.GetStructFieldIsNullable();
}
}
internal StructField(IStructFieldProxy strucFieldProxy)
{
structFieldProxy = strucFieldProxy;
}
public static StructField CreateStructField(string name, string dataType, bool isNullable)
{
return SparkCLREnvironment.SqlContextProxy.CreateStructField(name, dataType, isNullable);
}
}
public class DataType
{
private IStructDataTypeProxy structDataTypeProxy;
internal IStructDataTypeProxy StructDataTypeProxy
{
get
{
return structDataTypeProxy;
}
}
internal DataType(IStructDataTypeProxy structDataTypeProxy)
{
this.structDataTypeProxy = structDataTypeProxy;
}
public override string ToString()
{
return structDataTypeProxy.GetDataTypeString();
}
public string SimpleString()
{
return structDataTypeProxy.GetDataTypeSimpleString();
}
}
}

Просмотреть файл

@ -0,0 +1,17 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Streaming
{
//TODO - complete the impl
public class DStream<T>
{
}
}

Просмотреть файл

@ -0,0 +1,16 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Microsoft.Spark.CSharp.Streaming
{
//TODO - complete the impl
public class Kafka
{
}
}

Просмотреть файл

@ -0,0 +1,59 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
namespace Microsoft.Spark.CSharp.Streaming
{
/**
* Main entry point for Spark Streaming functionality. It provides methods used to create
* [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
* created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
* configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
* The associated SparkContext can be accessed using `context.sparkContext`. After
* creating and transforming DStreams, the streaming computation can be started and stopped
* using `context.start()` and `context.stop()`, respectively.
* `context.awaitTermination()` allows the current thread to wait for the termination
* of the context by `stop()` or by an exception.
*/
public class StreamingContext
{
public StreamingContext(SparkContext sparkContext, long durationMS)
{
throw new NotImplementedException();
}
/**
* Set each DStreams in this context to remember RDDs it generated in the last given duration.
* DStreams remember RDDs only for a limited duration of time and releases them for garbage
* collection. This method allows the developer to specify how long to remember the RDDs (
* if the developer wishes to query old data outside the DStream computation).
* @param duration Minimum duration that each DStream should remember its RDDs
*/
public void Remember(long durationMS)
{
throw new NotImplementedException();
}
/**
* Set the context to periodically checkpoint the DStream operations for driver
* fault-tolerance.
* @param directory HDFS-compatible directory where the checkpoint data will be reliably stored.
* Note that this must be a fault-tolerant file system like HDFS for
*/
public void checkpoint(string directory)
{
throw new NotImplementedException();
}
public DStream<string> TextFileStream(string directory)
{
throw new NotImplementedException();
}
}
}

Просмотреть файл

@ -0,0 +1,19 @@
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<!--*************************************************************************-->
<!--** Uncomment the following settings to run Spark driver executable in **local** or **debug** modes ** -->
<!--** In debug mode, the driver is not launched by CSharpRunner but launched from VS or command prompt not configured for SparkCLR ** -->
<!--** CSharpBackend should be launched in debug mode as well and the port number from that should be used below ** -->
<!--** Command to launch CSharpBackend in debug mode is "sparkclr-submit.cmd debug" ** -->
<!-- CSharpWorkerPath setting is required in ** Local or Debug ** modes -->
<!-- CSharpBackendPortNumber settings are required in ** Debug ** mode only -->
<!--
<add key="CSharpWorkerPath" value="C:\SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\Debug\CSharpWorker.exe"/>
<add key="CSharpBackendPortNumber" value="0"/>
-->
<!--*************************************************************************-->
</appSettings>
</configuration>

Просмотреть файл

@ -0,0 +1,52 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Configuration;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
internal class Configuration
{
public string SparkLocalDirectoryOverride
{
get;
set;
}
public string SampleDataLocation
{
get;
set;
}
public string SamplesToRun
{
get;
set;
}
public string GetInputDataPath(string fileName)
{
if (SampleDataLocation.StartsWith("hdfs://"))
{
var clusterPath = SampleDataLocation + "/" + fileName;
Console.WriteLine("Cluster path " + clusterPath);
return clusterPath;
}
else
{
return new Uri(Path.Combine(SampleDataLocation, fileName)).ToString();
}
}
}
}

Просмотреть файл

@ -0,0 +1,240 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Sql;
namespace Microsoft.Spark.CSharp.Samples
{
class DataFrameSamples
{
private const string PeopleJson = @"people.json";
private const string OrderJson = @"order.json";
private const string RequestsLog = @"requestslog.txt";
private const string MetricsLog = @"metricslog.txt";
private static SqlContext sqlContext;
private static SqlContext GetSqlContext()
{
return sqlContext ?? (sqlContext = new SqlContext(SparkCLRSamples.SparkContext));
}
/// <summary>
/// Sample to show schema of DataFrame
/// </summary>
[Sample]
internal static void DFShowSchemaSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
peopleDataFrame.Explain(true);
peopleDataFrame.ShowSchema();
}
/// <summary>
/// Sample to register a DataFrame as temptable and run queries
/// </summary>
[Sample]
internal static void DFRegisterTableSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
peopleDataFrame.RegisterTempTable("people");
var nameFilteredDataFrame = GetSqlContext().Sql("SELECT name, address.city, address.state FROM people where name='Bill'");
var countDataFrame = GetSqlContext().Sql("SELECT count(name) FROM people where name='Bill'");
var maxAgeDataFrame = GetSqlContext().Sql("SELECT max(age) FROM people where name='Bill'");
long maxAgeDataFrameRowsCount = maxAgeDataFrame.Count();
long nameFilteredDataFrameRowsCount = nameFilteredDataFrame.Count();
long countDataFrameRowsCount = countDataFrame.Count();
Console.WriteLine("nameFilteredDataFrameRowsCount={0}, maxAgeDataFrameRowsCount={1}, countDataFrameRowsCount={2}", nameFilteredDataFrameRowsCount, maxAgeDataFrameRowsCount, countDataFrameRowsCount);
}
/// <summary>
/// Sample to load a text file as dataframe
/// </summary>
[Sample]
internal static void DFTextFileLoadDataFrameSample()
{
var requestsSchema = StructType.CreateStructType(
new List<StructField>
{
StructField.CreateStructField("guid", "string", false),
StructField.CreateStructField("datacenter", "string", false),
StructField.CreateStructField("abtestid", "string", false),
StructField.CreateStructField("traffictype", "string", false),
}
);
var requestsDateFrame = GetSqlContext().TextFile(SparkCLRSamples.Configuration.GetInputDataPath(RequestsLog), requestsSchema);
requestsDateFrame.RegisterTempTable("requests");
var guidFilteredDataFrame = GetSqlContext().Sql("SELECT guid, datacenter FROM requests where guid = '4628deca-139d-4121-b540-8341b9c05c2a'");
guidFilteredDataFrame.Show();
requestsDateFrame.ShowSchema();
requestsDateFrame.Show();
//var count = requestsDateFrame.Count();
guidFilteredDataFrame.ShowSchema();
guidFilteredDataFrame.Show();
//var filteredCount = guidFilteredDataFrame.Count();
}
private static DataFrame GetMetricsDataFrame()
{
var metricsSchema = StructType.CreateStructType(
new List<StructField>
{
StructField.CreateStructField("unknown", "string", false),
StructField.CreateStructField("date", "string", false),
StructField.CreateStructField("time", "string", false),
StructField.CreateStructField("guid", "string", false),
StructField.CreateStructField("lang", "string", false),
StructField.CreateStructField("country", "string", false),
StructField.CreateStructField("latency", "integer", false)
}
);
return
GetSqlContext()
.TextFile(SparkCLRSamples.Configuration.GetInputDataPath(MetricsLog), metricsSchema);
}
/// <summary>
/// Sample to load two text files and join them using temptable constructs
/// </summary>
[Sample]
internal static void DFTextFileJoinTempTableSample()
{
var requestsDataFrame = GetSqlContext().TextFile(SparkCLRSamples.Configuration.GetInputDataPath(RequestsLog));
var metricsDateFrame = GetSqlContext().TextFile(SparkCLRSamples.Configuration.GetInputDataPath(MetricsLog));
metricsDateFrame.ShowSchema();
requestsDataFrame.RegisterTempTable("requests");
metricsDateFrame.RegisterTempTable("metrics");
//C0 - guid in requests DF, C3 - guid in metrics DF
var join = GetSqlContext().Sql(
"SELECT joinedtable.datacenter, max(joinedtable.latency) maxlatency, avg(joinedtable.latency) avglatency " +
"FROM (SELECT a.C1 as datacenter, b.C6 as latency from requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
"GROUP BY datacenter");
join.ShowSchema();
join.Show();
//var count = join.Count();
}
/// <summary>
/// Sample to load two text files and join them using DataFrame DSL
/// </summary>
[Sample]
internal static void DFTextFileJoinTableDSLSample()
{
//C0 - guid, C1 - datacenter
var requestsDataFrame = GetSqlContext().TextFile(SparkCLRSamples.Configuration.GetInputDataPath(RequestsLog)).Select("C0", "C1");
//C3 - guid, C6 - latency
var metricsDateFrame = GetSqlContext().TextFile(SparkCLRSamples.Configuration.GetInputDataPath(MetricsLog), ",", false, true).Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = requestsDataFrame.Join(metricsDateFrame, requestsDataFrame["C0"] == metricsDateFrame["C3"]).GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
var avgLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "avg" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
avgLatencyByDcDataFrame.ShowSchema();
avgLatencyByDcDataFrame.Show();
}
/// <summary>
/// Sample to iterate on the schema of DataFrame
/// </summary>
[Sample]
internal static void DFSchemaSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
var peopleDataFrameSchema = peopleDataFrame.Schema;
var peopleDataFrameSchemaFields = peopleDataFrameSchema.Fields;
foreach (var peopleDataFrameSchemaField in peopleDataFrameSchemaFields)
{
var name = peopleDataFrameSchemaField.Name;
var dataType = peopleDataFrameSchemaField.DataType;
var stringVal = dataType.ToString();
var simpleStringVal = dataType.SimpleString();
var isNullable = peopleDataFrameSchemaField.IsNullable;
Console.WriteLine("Name={0}, DT.string={1}, DT.simplestring={2}, DT.isNullable={3}", name, stringVal, simpleStringVal, isNullable);
}
}
/// <summary>
/// Sample to convert DataFrames to RDD
/// </summary>
[Sample]
internal static void DFConversionSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
var stringRddCreatedFromDataFrame = peopleDataFrame.ToJSON();
var stringRddCreatedFromDataFrameRowCount = stringRddCreatedFromDataFrame.Count();
var byteArrayRddCreatedFromDataFrame = peopleDataFrame.ToRDD();
var byteArrayRddCreatedFromDataFrameRowCount = byteArrayRddCreatedFromDataFrame.Count();
Console.WriteLine("stringRddCreatedFromDataFrameRowCount={0}, byteArrayRddCreatedFromDataFrameRowCount={1}", stringRddCreatedFromDataFrameRowCount, byteArrayRddCreatedFromDataFrameRowCount);
}
/// <summary>
/// Sample to perform simple select and filter on DataFrame using DSL
/// </summary>
[Sample]
internal static void DFProjectionFilterDSLSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
var projectedFilteredDataFrame = peopleDataFrame.Select("name", "address.state")
.Where("name = 'Bill' or state = 'California'");
projectedFilteredDataFrame.ShowSchema();
projectedFilteredDataFrame.Show();
var projectedFilteredDataFrameCount = projectedFilteredDataFrame.Count();
projectedFilteredDataFrame.RegisterTempTable("selectedcolumns");
var sqlCountDataFrame = GetSqlContext().Sql("SELECT count(name) FROM selectedcolumns where name='Bill'");
var sqlCountDataFrameRowCount = sqlCountDataFrame.Count();
Console.WriteLine("projectedFilteredDataFrameCount={0}, sqlCountDataFrameRowCount={1}", projectedFilteredDataFrameCount, sqlCountDataFrameRowCount);
}
/// <summary>
/// Sample to join DataFrame using DSL
/// </summary>
[Sample]
internal static void DFJoinSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
var orderDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(OrderJson));
//Following example does not work in Spark 1.4.1 - uncomment in 1.5
//var peopleDataFrame2 = GetSqlContext().JsonFile(Samples.GetInputDataPath(PeopleJson));
//var columnNameJoin = peopleDataFrame.Join(peopleDataFrame2, new string[] {"id"});
//columnNameJoin.Show();
//columnNameJoin.ShowDF();
var expressionJoin = peopleDataFrame.Join(orderDataFrame, peopleDataFrame["id"] == orderDataFrame["personid"]);
expressionJoin.ShowSchema();
expressionJoin.Show();
}
/// <summary>
/// Sample to perform aggregatoin on DataFrame using DSL
/// </summary>
[Sample]
internal static void DFAggregateDSLSample()
{
var peopleDataFrame = GetSqlContext().JsonFile(SparkCLRSamples.Configuration.GetInputDataPath(PeopleJson));
var countAggDataFrame = peopleDataFrame.Where("name = 'Bill'").Agg(new Dictionary<string, string> {{"name", "count"}});
var countAggDataFrameCount = countAggDataFrame.Count();
var maxAggDataFrame = peopleDataFrame.GroupBy("name").Agg(new Dictionary<string, string> {{"age", "max"}});
var maxAggDataFrameCount = maxAggDataFrame.Count();
Console.WriteLine("countAggDataFrameCount: {0}, maxAggDataFrameCount: {1}.", countAggDataFrameCount, maxAggDataFrameCount);
}
}
}

Просмотреть файл

@ -0,0 +1,59 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
class DoubleRDDSamples
{
[Sample]
internal static void DoubleRDDSumSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1.0, 2.0, 3.0 }, 2).Sum());
}
[Sample]
internal static void DoubleRDDStatsSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).Stats().ToString());
}
[Sample]
internal static void DoubleRDDMeanSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).Mean().ToString());
}
[Sample]
internal static void DoubleRDVarianceSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).Variance().ToString());
}
[Sample]
internal static void DoubleRDDStdevSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).Stdev().ToString());
}
[Sample]
internal static void DoubleRDDSampleStdevSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).SampleStdev().ToString());
}
[Sample]
internal static void DoubleRDDSampleVarianceSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1, 2, 3 }, 2).SampleVariance().ToString());
}
}
}

Просмотреть файл

@ -0,0 +1,67 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
namespace Microsoft.Spark.CSharp.Samples
{
class MiscSamples
{
/// <summary>
/// Sample to compute th evalue of Pi
/// </summary>
[Sample]
private static void PiSample()
{
var slices = 2;
var n = (int) Math.Min(100000L*slices, int.MaxValue);
var values = new List<int>(n);
for (int i = 0; i <= n; i++)
{
values.Add(i);
}
var count = SparkCLRSamples.SparkContext.Parallelize(values, slices)
.Map(i =>
{
var random = new Random(); //if this definition is moved out of the anonymous method,
//the delegate will form a closure and the compiler
//will generate a type for it without Serializable attribute
//and hence serialization will fail
//the alternative is to use PiHelper below which is marked Serializable
var x = random.NextDouble() * 2 - 1;
var y = random.NextDouble() * 2 - 1;
return (x * x + y * y) < 1 ? 1 : 0;
}
)
.Reduce((x, y) => x + y);
Console.WriteLine("Pi is roughly " + 4.0 * (int)count / n);
/********* alternative to the count method provided above ***********/
var countComputedUsingAnotherApproach = SparkCLRSamples.SparkContext.Parallelize(values, slices).Map(new PiHelper().Execute).Reduce((x, y) => x + y);
Console.WriteLine("Pi is roughly " + 4.0 * (int)countComputedUsingAnotherApproach / n);
/********************************************************************/
}
[Serializable]
private class PiHelper
{
private readonly Random random = new Random();
public int Execute(int input)
{
var x = random.NextDouble() * 2 - 1;
var y = random.NextDouble() * 2 - 1;
return (x * x + y * y) < 1 ? 1 : 0;
}
}
}
}

Просмотреть файл

@ -0,0 +1,364 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
class PairRDDSamples
{
[Sample]
internal static void PairRDDCollectAsMapSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).CollectAsMap();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDKeysSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Keys().Collect();
Console.WriteLine(m[0]);
Console.WriteLine(m[1]);
}
[Sample]
internal static void PairRDDValuesSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Values().Collect();
Console.WriteLine(m[0]);
Console.WriteLine(m[1]);
}
[Sample]
internal static void PairRDDReduceByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.ReduceByKey((x, y) => x + y).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDReduceByKeyLocallySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.ReduceByKeyLocally((x, y) => x + y);
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDCountByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.CountByKey();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDJoinSample()
{
var l = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 4),
}, 1);
var r = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 2),
new KeyValuePair<string, int>("a", 3),
}, 1);
var m = l.Join(r, 2).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDLeftOuterJoinSample()
{
var l = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 4),
}, 2);
var r = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 2),
}, 1);
var m = l.LeftOuterJoin(r).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDRightOuterJoinSample()
{
var l = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 2),
}, 1);
var r = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 4),
}, 2);
var m = l.RightOuterJoin(r).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDFullOuterJoinSample()
{
var l = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 4),
}, 2);
var r = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 2),
new KeyValuePair<string, int>("c", 8),
}, 2);
var m = l.FullOuterJoin(r).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDPartitionBySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(new[] { 1, 2, 3, 4, 2, 4, 1 }, 1)
.Map(x => new KeyValuePair<int, int>(x, x))
.PartitionBy(3)
.Glom()
.Collect();
foreach (var a in m)
{
foreach (var kv in a)
{
Console.Write(kv + " ");
}
Console.WriteLine();
}
}
[Sample]
internal static void PairRDDCombineByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDAggregateByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.AggregateByKey(() => 0, (x, y) => x + y, (x, y) => x + y).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDFoldByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.FoldByKey(() => 0, (x, y) => x + y).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDGroupByKeySample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, int>("a", 1),
new KeyValuePair<string, int>("b", 1),
new KeyValuePair<string, int>("a", 1)
}, 2)
.GroupByKey().MapValues(l => string.Join(" ", l)).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDMapValuesSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, string[]>("a", new[]{"apple", "banana", "lemon"}),
new KeyValuePair<string, string[]>("b", new[]{"grapes"})
}, 2)
.MapValues(x => x.Length).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDFlatMapValuesSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(
new[]
{
new KeyValuePair<string, string[]>("a", new[]{"x", "y", "z"}),
new KeyValuePair<string, string[]>("b", new[]{"p", "r"})
}, 2)
.FlatMapValues(x => x).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDGroupWithSample()
{
var x = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4)}, 2);
var y = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 2)}, 1);
var m = x.GroupWith(y).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2)).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDGroupWithSample2()
{
var x = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2);
var y = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2);
var z = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1);
var m = x.GroupWith(y, z).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3)).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDGroupWithSample3()
{
var x = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2);
var y = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2);
var z = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1);
var w = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int>("b", 42) }, 1);
var m = x.GroupWith(y, z, w).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3) + " : " + string.Join(" ", l.Item4)).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
//[Sample]
internal static void PairRDDSampleByKeySample()
{
var fractions = new Dictionary<string, double> { { "a", 0.2 }, { "b", 0.1 } };
var rdd = SparkCLRSamples.SparkContext.Parallelize(fractions.Keys.ToArray(), 2).Cartesian(SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 1000), 2));
var sample = rdd.Map(t => new KeyValuePair<string, int>(t.Item1, t.Item2)).SampleByKey(false, fractions, 2).GroupByKey().Collect();
Console.WriteLine(sample);
}
[Sample]
internal static void PairRDDSubtractByKeySample()
{
var x = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int?>("a", 1), new KeyValuePair<string, int?>("b", 4), new KeyValuePair<string, int?>("b", 5), new KeyValuePair<string, int?>("a", 2) }, 2);
var y = SparkCLRSamples.SparkContext.Parallelize(new[] { new KeyValuePair<string, int?>("a", 3), new KeyValuePair<string, int?>("c", null) }, 2);
var m = x.SubtractByKey(y).Collect();
foreach (var kv in m)
Console.WriteLine(kv);
}
[Sample]
internal static void PairRDDLookupSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 1000).Zip(Enumerable.Range(0, 1000), (x, y) => new KeyValuePair<int, int>(x, y)), 10);
Console.WriteLine(string.Join(",", rdd.Lookup(42)));
Console.WriteLine(string.Join(",", rdd.Lookup(1024)));
}
}
}

Просмотреть файл

@ -0,0 +1,104 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Configuration;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
/// <summary>
/// Samples for SparkCLR
/// </summary>
public class SparkCLRSamples
{
internal static Configuration Configuration = new Configuration();
internal static SparkContext SparkContext;
static void Main(string[] args)
{
ProcessArugments(args);
using (SparkCLREnvironment.Initialize())
{
SparkContext = CreateSparkContext();
RunSamples();
SparkContext.Stop();
}
}
// Creates and returns a context
private static SparkContext CreateSparkContext()
{
var conf = new SparkConf();
if (Configuration.SparkLocalDirectoryOverride != null)
{
conf.Set("spark.local.dir", Configuration.SparkLocalDirectoryOverride);
}
return new SparkContext(conf);
}
//finds all methods that are marked with [Sample] attribute and
//runs all of them if sparkclr.samples.torun commandline arg is not used
//or just runs the ones that are provided as comma separated list
private static void RunSamples()
{
var samples = Assembly.GetEntryAssembly().GetTypes()
.SelectMany(type => type.GetMethods(BindingFlags.NonPublic | BindingFlags.Static))
.Where(method => method.GetCustomAttributes(typeof(SampleAttribute), false).Length > 0)
.OrderByDescending(method => method.Name);
foreach (var sample in samples)
{
bool runSample = true;
if (Configuration.SamplesToRun != null)
{
if (!Configuration.SamplesToRun.Contains(sample.Name)) //assumes method/sample names are unique
{
runSample = false;
}
}
if (runSample)
{
Console.WriteLine("----- Running sample {0} -----", sample.Name);
sample.Invoke(null, new object[] { });
}
}
}
//simple commandline arg processor
private static void ProcessArugments(string[] args)
{
Console.WriteLine("Arguments to SparkCLRSamples are {0}", string.Join(",", args));
for (int i=0; i<args.Length;i++)
{
if (args[i].Equals("spark.local.dir", StringComparison.InvariantCultureIgnoreCase))
{
Configuration.SparkLocalDirectoryOverride = args[i + 1];
}
else if (args[i].Equals("sparkclr.sampledata.loc", StringComparison.InvariantCultureIgnoreCase))
{
Configuration.SampleDataLocation = args[i + 1];
}
else if (args[i].Equals("sparkclr.samples.torun", StringComparison.InvariantCultureIgnoreCase))
{
Configuration.SamplesToRun = args[i + 1];
}
}
}
}
/// <summary>
/// Attribute that marks a method as a sample
/// </summary>
[AttributeUsage(AttributeTargets.Method)]
internal class SampleAttribute : Attribute
{}
}

Просмотреть файл

@ -0,0 +1,453 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
class RDDSamples
{
//[Sample]
internal static void RDDCheckpointSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 100), 4);
rdd.Cache();
rdd.Unpersist();
rdd.Checkpoint();
Console.WriteLine(rdd.IsCheckpointed);
}
[Sample]
internal static void RDDSampleSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 100), 4);
Console.WriteLine(rdd.Sample(false, 0.1, 81).Count());
}
[Sample]
internal static void RDDRandomSplitSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 500), 1);
var rdds = rdd.RandomSplit(new double[] { 2, 3 }, 17);
Console.WriteLine(rdds[0].Count());
Console.WriteLine(rdds[1].Count());
}
[Sample]
internal static void RDDTakeSampleSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 10), 2);
Console.WriteLine(rdd.TakeSample(true, 20, 1).Length);
Console.WriteLine(rdd.TakeSample(false, 5, 2).Length);
Console.WriteLine(rdd.TakeSample(false, 15, 3).Length);
}
[Sample]
internal static void RDDUnionSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 1, 2, 3 }, 1);
Console.WriteLine(string.Join(",", rdd.Union(rdd).Collect()));
}
[Sample]
internal static void RDDIntersectionSample()
{
var rdd1 = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 10, 2, 3, 4, 5 }, 1);
var rdd2 = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 6, 2, 3, 7, 8 }, 1);
Console.WriteLine(string.Join(",", rdd1.Intersection(rdd2).Collect()));
}
[Sample]
internal static void RDDGlomSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4 }, 2);
foreach (var l in rdd.Glom().Collect())
Console.WriteLine(string.Join(",", l));
}
[Sample]
internal static void RDDGroupBySample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 1, 2, 3, 5, 8 }, 1);
foreach (var kv in rdd.GroupBy(x => x % 2).Collect())
Console.WriteLine(kv.Key + ", " + string.Join(",", kv.Value));
}
[Sample]
internal static void RDDForeachSample()
{
SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Foreach(x => Console.Write(x + " "));
Console.WriteLine();
}
[Sample]
internal static void RDDForeachPartitionSample()
{
SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).ForeachPartition(iter => { foreach (var x in iter) Console.Write(x + " "); });
Console.WriteLine();
}
[Sample]
internal static void RDDReduceSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Reduce((x, y) => x + y));
}
[Sample]
internal static void RDDTreeReduceSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { -5, -4, -3, -2, -1, 1, 2, 3, 4 }, 10).TreeReduce((x, y) => x + y));
}
[Sample]
internal static void RDDFoldSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Fold(0, (x, y) => x + y));
}
[Sample]
internal static void RDDAggregateSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4 }, 1).Aggregate(0, (x, y) => x + y, (x, y) => x + y));
}
[Sample]
internal static void RDDTreeAggregateSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4 }, 1).TreeAggregate(0, (x, y) => x + y, (x, y) => x + y));
}
[Sample]
internal static void RDDCountByValueSample()
{
foreach (var item in SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 1, 2, 2 }, 2).CountByValue())
Console.WriteLine(item);
}
[Sample]
internal static void RDDTakeSample()
{
Console.WriteLine(string.Join(",", SparkCLRSamples.SparkContext.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Cache().Take(2)));
}
[Sample]
internal static void RDDFirstSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[] { 2, 3, 4 }, 2).First());
}
[Sample]
internal static void RDDIsEmptySample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new int[0], 1).IsEmpty());
}
[Sample]
internal static void RDDSubtractSample()
{
var x = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4 }, 1);
var y = SparkCLRSamples.SparkContext.Parallelize(new int[] { 3 }, 1);
Console.WriteLine(string.Join(",", x.Subtract(y).Collect()));
}
[Sample]
internal static void RDDKeyBySample()
{
foreach (var kv in SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4 }, 1).KeyBy(x => x * x).Collect())
Console.Write(kv + " ");
Console.WriteLine();
}
[Sample]
internal static void RDDRepartitionSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5, 6, 7 }, 4);
Console.WriteLine(rdd.Glom().Collect().Length);
Console.WriteLine(rdd.Repartition(2).Glom().Collect().Length);
}
[Sample]
internal static void RDDCoalesceSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3);
Console.WriteLine(rdd.Glom().Collect().Length);
Console.WriteLine(rdd.Coalesce(1).Glom().Collect().Length);
}
[Sample]
internal static void RDDZipSample()
{
var x = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 5), 1);
var y = SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(1000, 5), 1);
foreach (var t in x.Zip(y).Collect())
Console.WriteLine(t);
}
[Sample]
internal static void RDDZipWithIndexSample()
{
foreach (var t in SparkCLRSamples.SparkContext.Parallelize(new string[] { "a", "b", "c", "d" }, 3).ZipWithIndex().Collect())
Console.WriteLine(t);
}
[Sample]
internal static void RDDZipWithUniqueIdSample()
{
foreach (var t in SparkCLRSamples.SparkContext.Parallelize(new string[] { "a", "b", "c", "d", "e" }, 3).ZipWithUniqueId().Collect())
Console.WriteLine(t);
}
[Sample]
internal static void RDDSetNameSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new string[] { "a", "b", "c", "d", "e" }, 3);
Console.WriteLine(rdd.SetName("SampleRDD").Name);
}
[Sample]
internal static void RDDToDebugStringSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new string[] { "a", "b", "c", "d", "e" }, 3);
Console.WriteLine(rdd.ToDebugString());
}
[Sample]
internal static void RDDToLocalIteratorSample()
{
Console.WriteLine(string.Join(",", SparkCLRSamples.SparkContext.Parallelize(Enumerable.Range(0, 10), 1).ToLocalIterator().ToArray()));
}
[Sample]
internal static void RDDSaveAsTextFileSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new string[] { "a", "b", "c", "d", "e" }, 2);
var path = Path.GetTempFileName();
File.Delete(path);
rdd.SaveAsTextFile(path);
}
//[Sample]
internal static void RDDCartesianSample()
{
var rdd = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 2 }, 1);
foreach (var t in rdd.Cartesian(rdd).Collect())
Console.WriteLine(t);
}
[Sample]
internal static void RDDDistinctSample()
{
var m = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 1, 2, 3 }, 1).Distinct(1).Collect();
foreach (var v in m)
Console.Write(v + " ");
Console.WriteLine();
}
[Sample]
internal static void RDDMaxSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 1.0, 5.0, 43.0, 10.0 }, 2).Max());
}
[Sample]
internal static void RDDMinSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Parallelize(new double[] { 2.0, 5.0, 43.0, 10.0 }, 2).Min());
}
[Sample]
internal static void RDDTakeOrderedSample()
{
Console.WriteLine(string.Join(",", SparkCLRSamples.SparkContext.Parallelize(new int[] { 10, 1, 2, 9, 3, 4, 5, 6, 7 }, 2).TakeOrdered(6)));
}
[Sample]
internal static void RDDTopSample()
{
Console.WriteLine(string.Join(",", SparkCLRSamples.SparkContext.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Top(3)));
}
/// <summary>
/// Counts words in a file
/// </summary>
[Sample]
internal static void RDDWordCountSample()
{
var lines = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath("words.txt"), 1);
var words = lines.FlatMap(s => s.Split(new[] {" "}, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
.ReduceByKey((x, y) => x + y).Collect();
Console.WriteLine("*** Printing words and their counts ***");
foreach (var kvp in wordCounts)
{
Console.WriteLine("'{0}':{1}", kvp.Key, kvp.Value);
}
var wordCountsCaseInsensitve = words.Map(w => new KeyValuePair<string, int>(w.ToLower().Trim(), 1))
.ReduceByKey((x, y) => x + y).Collect();
Console.WriteLine("*** Printing words and their case insesitive counts ***");
foreach (var kvp in wordCountsCaseInsensitve)
{
Console.WriteLine("'{0}':{1}", kvp.Key, kvp.Value);
}
}
/// <summary>
/// Performs a join of 2 RDDs and run reduction
/// </summary>
[Sample]
internal static void RDDJoinSample()
{
var requests = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath("requestslog.txt"), 1);
var metrics = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath("metricslog.txt"), 1);
var requestsColumns = requests.Map(s =>
{
var columns = s.Split(new[] { "," }, StringSplitOptions.None);
return new KeyValuePair<string, string[]>(columns[0], new[] { columns[1], columns[2], columns[3] });
});
var metricsColumns = metrics.Map(s =>
{
var columns = s.Split(new[] { "," }, StringSplitOptions.None);
return new KeyValuePair<string, string[]>(columns[3], new[] { columns[4], columns[5], columns[6] });
});
var requestsJoinedWithMetrics = requestsColumns.Join(metricsColumns)
.Map(
s =>
new []
{
s.Key, //guid
s.Value.Item1[0], s.Value.Item1[1], s.Value.Item1[2], //dc, abtestid, traffictype
s.Value.Item2[0],s.Value.Item2[1], s.Value.Item2[2] //lang, country, metric
});
var latencyByDatacenter = requestsJoinedWithMetrics.Map(i => new KeyValuePair<string, int> (i[1], int.Parse(i[6]))); //key is "datacenter"
var maxLatencyByDataCenterList = latencyByDatacenter.ReduceByKey(Math.Max).Collect();
Console.WriteLine("***** Max latency metrics by DC *****");
foreach (var keyValuePair in maxLatencyByDataCenterList)
{
Console.WriteLine("Datacenter={0}, Max latency={1}", keyValuePair.Key, keyValuePair.Value);
}
var latencyAndCountByDatacenter = requestsJoinedWithMetrics.Map(i => new KeyValuePair<string, Tuple<int,int>> (i[1], new Tuple<int, int>(int.Parse(i[6]), 1)));
var sumLatencyAndCountByDatacenter = latencyAndCountByDatacenter.ReduceByKey((tuple, tuple1) => new Tuple<int, int>((tuple == null ? 0 : tuple.Item1) + tuple1.Item1, (tuple == null ? 0 : tuple.Item2) + tuple1.Item2));
var sumLatencyAndCountByDatacenterList = sumLatencyAndCountByDatacenter.Collect();
Console.WriteLine("***** Mean latency metrics by DC *****");
foreach (var keyValuePair in sumLatencyAndCountByDatacenterList)
{
Console.WriteLine("Datacenter={0}, Mean latency={1}", keyValuePair.Key, keyValuePair.Value.Item1/keyValuePair.Value.Item2);
}
}
/// <summary>
/// Sample for map and filter in RDD
/// </summary>
[Sample]
internal static void RDDMapFilterSample()
{
var mulogs = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath("csvtestlog.txt"), 2);
var mulogsProjected = mulogs.Map(x =>
{
var columns = x.Split(new[] { "," }, StringSplitOptions.None);
return string.Format("{0},{1},{2},{3}", columns[0], columns[1], columns[2], columns[3]);
});
var muLogsFiltered = mulogsProjected.Filter(s => s.Contains("US,EN"));
var count = muLogsFiltered.Count();
var collectedItems = muLogsFiltered.Collect();
Console.WriteLine("MapFilterExample: EN-US entries count is " + count);
Console.WriteLine("Items are...");
foreach (var collectedItem in collectedItems)
{
Console.WriteLine(collectedItem);
}
}
/// <summary>
/// Sample for distributing objects as RDD
/// </summary>
[Sample]
internal static void RDDSerializableObjectCollectionSample()
{
var personsRdd = SparkCLRSamples.SparkContext.Parallelize(new[] { new Person { Age = 3 }, new Person { Age = 10 }, new Person { Age = 15 } }, 3);
var derivedPersonsRdd = personsRdd.Map(x => new Person { Age = x.Age + 1 });
var countOfPersonsFiltered = derivedPersonsRdd.Filter(person => person.Age >= 11).Count();
Console.WriteLine("SerializableObjectCollectionExample: countOfPersonsFiltered " + countOfPersonsFiltered);
}
/// <summary>
/// Sample for distributing strings as RDD
/// </summary>
[Sample]
internal static void RDDStringCollectionSample()
{
var logEntriesRdd = SparkCLRSamples.SparkContext.Parallelize(new[] { "row1col1,row1col2", "row2col1,row2col2", "row3col3" }, 2);
var logEntriesColumnRdd = logEntriesRdd.Map(x => x.Split(new[] { "," }, StringSplitOptions.RemoveEmptyEntries));
var countOfInvalidLogEntries = logEntriesColumnRdd.Filter(stringarray => stringarray.Length != 2).Count();
Console.WriteLine("StringCollectionExample: countOfInvalidLogEntries " + countOfInvalidLogEntries);
}
/// <summary>
/// Sample for distributing int as RDD
/// </summary>
[Sample]
internal static void RDDIntCollectionSample()
{
var numbersRdd = SparkCLRSamples.SparkContext.Parallelize(new[] { 1, 100, 5, 55, 65 }, 3);
var oddNumbersRdd = numbersRdd.Filter(x => x % 2 != 0);
var countOfOddNumbers = oddNumbersRdd.Count();
Console.WriteLine("IntCollectionExample: countOfOddNumbers " + countOfOddNumbers);
}
/// <summary>
/// Sample for CombineByKey method
/// </summary>
[Sample]
internal static void RDDCombineBySample()
{
var markets = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath("market.tab"), 1);
long totalMarketsCount = markets.Count();
var marketsByKey = markets.Map(x => new KeyValuePair<string, string>(x.Substring(0, x.IndexOf('-')), x));
var categories = marketsByKey.PartitionBy(2)
.CombineByKey(() => "", (c, v) => v.Substring(0, v.IndexOf('-')), (c1, c2) => c1, 2);
var categoriesCollectedCount = categories.Collect().Count();
var joinedRddCollectedItemCount = marketsByKey.Join(categories, 2).Collect().Count();
var filteredRddCollectedItemCount = markets.Filter(line => line.Contains("EN")).Collect().Count();
//var markets = filtered.reduce((left, right) => left + right);
var combinedRddCollectedItemCount = marketsByKey.PartitionBy(2).CombineByKey(() => "", (c, v) => c + v, (c1, c2) => c1 + c2, 2).Collect().Count();
Console.WriteLine("MarketExample: totalMarketsCount {0}, joinedRddCollectedItemCount {1}, filteredRddCollectedItemCount {2}, combinedRddCollectedItemCount {3}", totalMarketsCount, joinedRddCollectedItemCount, filteredRddCollectedItemCount, combinedRddCollectedItemCount);
}
}
[Serializable]
public class Person
{
public int Age { get; set; }
}
}

Просмотреть файл

@ -0,0 +1,102 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Import Project="$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props" Condition="Exists('$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props')" />
<PropertyGroup>
<Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
<Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform>
<ProjectGuid>{913E6A56-9839-4379-8B3C-855BA9341663}</ProjectGuid>
<OutputType>Exe</OutputType>
<AppDesignerFolder>Properties</AppDesignerFolder>
<RootNamespace>Microsoft.Spark.CSharp.Samples</RootNamespace>
<AssemblyName>SparkCLRSamples</AssemblyName>
<TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
<FileAlignment>512</FileAlignment>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Debug|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugSymbols>true</DebugSymbols>
<DebugType>full</DebugType>
<Optimize>false</Optimize>
<OutputPath>bin\Debug\</OutputPath>
<DefineConstants>DEBUG;TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
<Prefer32Bit>false</Prefer32Bit>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Release|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugType>pdbonly</DebugType>
<Optimize>true</Optimize>
<OutputPath>bin\Release\</OutputPath>
<DefineConstants>TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
</PropertyGroup>
<ItemGroup>
<Reference Include="System" />
<Reference Include="System.Configuration" />
<Reference Include="System.Core" />
<Reference Include="System.Xml.Linq" />
<Reference Include="System.Data.DataSetExtensions" />
<Reference Include="Microsoft.CSharp" />
<Reference Include="System.Data" />
<Reference Include="System.Xml" />
</ItemGroup>
<ItemGroup>
<Compile Include="Configuration.cs" />
<Compile Include="DataFrameSamples.cs" />
<Compile Include="MiscSamples.cs" />
<Compile Include="DoubleRDDSamples.cs" />
<Compile Include="Program.cs" />
<Compile Include="PairRDDSamples.cs" />
<Compile Include="SparkContextSamples.cs" />
<Compile Include="RDDSamples.cs" />
</ItemGroup>
<ItemGroup>
<None Include="App.config" />
<None Include="data\market.tab">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="data\people.json">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Include="data\order.json">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
<ItemGroup>
<Folder Include="Properties\" />
</ItemGroup>
<ItemGroup>
<Content Include="data\csvtestlog.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
<Content Include="data\metricslog.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
<Content Include="data\requestslog.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
<Content Include="data\words.txt">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\..\Adapter\Microsoft.Spark.CSharp\Adapter.csproj">
<Project>{ce999a96-f42b-4e80-b208-709d7f49a77c}</Project>
<Name>Adapter</Name>
</ProjectReference>
<ProjectReference Include="..\..\Worker\Microsoft.Spark.CSharp\Worker.csproj">
<Project>{82c9d3b2-e4fb-4713-b980-948c1e96a10a}</Project>
<Name>Worker</Name>
</ProjectReference>
</ItemGroup>
<Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />
<!-- To modify your build process, add your task inside one of the targets below and uncomment it.
Other similar extension points exist, see Microsoft.Common.targets.
<Target Name="BeforeBuild">
</Target>
<Target Name="AfterBuild">
</Target>
-->
</Project>

Просмотреть файл

@ -0,0 +1,108 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop;
namespace Microsoft.Spark.CSharp.Samples
{
class SparkContextSamples
{
[Serializable]
internal class BroadcastHelper<T>
{
private readonly T[] value;
internal BroadcastHelper(T[] value)
{
this.value = value;
}
internal IEnumerable<T> Execute(int i)
{
return value;
}
}
[Sample]
internal static void SparkContextBroadcastSample()
{
var b = SparkCLRSamples.SparkContext.Broadcast<int[]>(Enumerable.Range(1, 5).ToArray());
foreach (var value in b.Value)
Console.Write(value + " ");
Console.WriteLine();
b.Unpersist();
var r = SparkCLRSamples.SparkContext.Parallelize(new[] { 0, 0 }, 1).FlatMap(new BroadcastHelper<int>(b.Value).Execute).Collect();
foreach (var value in r)
Console.Write(value + " ");
Console.WriteLine();
}
[Serializable]
internal class AccumulatorHelper
{
private Accumulator<int> accumulator;
internal AccumulatorHelper(Accumulator<int> accumulator)
{
this.accumulator = accumulator;
}
internal int Execute(int input)
{
accumulator += 1;
return input;
}
}
[Sample]
internal static void SparkContextAccumulatorSample()
{
var a = SparkCLRSamples.SparkContext.Accumulator<int>(1);
var r = SparkCLRSamples.SparkContext.Parallelize(new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, 3).Map(new AccumulatorHelper(a).Execute).Collect();
Console.WriteLine(a.Value);
}
[Sample]
internal static void SparkContextSample()
{
Console.WriteLine(SparkCLRSamples.SparkContext.Version);
Console.WriteLine(SparkCLRSamples.SparkContext.SparkUser);
Console.WriteLine(SparkCLRSamples.SparkContext.StartTime);
Console.WriteLine(SparkCLRSamples.SparkContext.DefaultParallelism);
Console.WriteLine(SparkCLRSamples.SparkContext.DefaultMinPartitions);
StatusTracker StatusTracker = SparkCLRSamples.SparkContext.StatusTracker;
//var file = Path.GetTempFileName();
//File.WriteAllText(file, "Sample");
//SparkCLRSamples.SparkContext.AddFile(file);
var dir = Path.GetTempPath();
SparkCLRSamples.SparkContext.SetCheckpointDir(dir);
SparkCLRSamples.SparkContext.SetLogLevel("DEBUG");
//SparkCLRSamples.SparkContext.SetJobGroup("SampleGroupId", "Sample Description");
SparkCLRSamples.SparkContext.SetLocalProperty("SampleKey", "SampleValue");
Console.WriteLine(SparkCLRSamples.SparkContext.GetLocalProperty("SampleKey"));
SparkCLRSamples.SparkContext.CancelJobGroup("SampleGroupId");
SparkCLRSamples.SparkContext.CancelAllJobs();
}
[Sample]
internal static void SparkContextUnionSample()
{
var rdd1 = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 1, 2, 3 }, 1);
var rdd2 = SparkCLRSamples.SparkContext.Parallelize(new int[] { 1, 1, 2, 3 }, 1);
Console.WriteLine(string.Join(",", SparkCLRSamples.SparkContext.Union(new[] { rdd1, rdd2 }).Collect()));
}
}
}

Просмотреть файл

@ -0,0 +1,10 @@
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,GB,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,ES,ES,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,GB,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,GB,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns
-,;;;,US,EN,-,f4abeae0812248a899e9d80a777d83ef,workflowname,someothercolumns

Просмотреть файл

@ -0,0 +1,281 @@
EN-BT
EN-KY
EN-IQ
EN-TO
ES-ES
EN-BA
EN-JM
FR-RE
EN-MZ
SR-LATN-RS
FIL-PH
EN-GH
EN-BW
FR-SN
SQ-AL
EN-MD
RW-RW
NL-SX
PT-BR
SV-FI
ES-NI
FR-LU
ZH-SG
LB-LU
MK-MK
MN-MN
EN-AU
EN-MO
UZ-LATN-UZ
EN-SA
KA-GE
NL-BE
FR-CI
SI-LK
EN-YE
FR-BE
UR-PK
EN-MA
FR-GA
HY-AM
EN-AG
IT-CH
AR-OM
LO-LA
FR-LB
FR-TN
EN-WW
RO-MD
EN-TJ
EN-MV
NL-CW
EN-IE
PT-MZ
EN-BS
FR-FR
ET-EE
EN-DK
ES-HN
ES-UY
PT-AO
AR-SD
DE-CH
ES-GT
BS-CYRL-BA
EN-SG
AR-DJ
KO-KR
EN-BZ
EN-TN
FR-PF
EN-UG
EN-ZM
MR-IN
EN-KW
EN-SZ
AR-SY
FR-CG
SR-LATN-BA
EN-GR
TR-TR
EN-LA
EL-GR
EN-CY
EN-TT
FR-YT
GU-IN
AR-YE
IS-IS
LV-LV
SR-RS
RO-RO
TA-IN
EN-OM
AR-PS
ZH-CN
EN-JO
NB-NO
PRS-AF
ES-PY
HE-IL
OR-IN
UK-UA
MS-MY
EN-GG
EN-GY
PA-IN
EN-PG
FR-HT
EN-ZA
ES-AR
SK-SK
PT-CV
FR-BJ
FR-NC
EN-TZ
FR-MR
EN-PH
EN-AO
EN-BF
EN-QA
EN-PK
FR-ML
EN-AZ
KM-KH
EN-PS
PS-AF
EN-LS
EN-MW
FR-NE
EN-LR
EN-SD
EN-MP
EN-ZW
EN-RS
DE-AT
BG-BG
IT-IT
MT-MT
FR-TG
HR-HR
EN-WS
ES-MX
FR-MG
AR-MA
FR-GP
TH-TH
EN-ML
EN-MU
ES-EC
EN-HK
FR-BF
EN-TC
AR-LY
ES-SV
ES-VE
MS-BN
FR-CH
ZH-TW
DE-DE
AM-ET
EN-EG
CA-ES
DA-DK
DE-LI
FR-BI
HU-HU
AR-LB
DE-LU
EN-LY
FR-MC
KK-KZ
EN-AE
EN-DM
EN-MY
EN-SK
VI-VN
EN-ME
ID-ID
AR-AE
NL-SR
EN-CZ
AR-BH
FI-FI
AR-EG
FR-CD
FR-MU
EN-JE
EN-UZ
FR-CM
EN-NA
EN-IN
ES-CL
AR-DZ
EU-ES
FA-IR
FR-DZ
RU-RU
EN-IL
PT-PT
AZ-LATN-AZ
EN-SY
PL-PL
EN-BD
EL-CY
AR-IQ
TK-TM
NN-NO
EN-ID
AR-QA
EN-SS
FR-CA
FR-GF
AR-TN
EN-DZ
TE-IN
AR-XA
EN-BB
ES-BO
FR-GN
SV-SE
ZH-MO
EN-SB
CS-CZ
TG-CYRL-TJ
EN-AS
ES-DO
NE-NP
SR-LATN-ME
ZH-HK
EN-FI
EN-NG
ES-PE
EN-GU
KY-KG
SR-ME
EN-BJ
EN-SO
NL-NL
FR-MQ
EN-BM
CA-AD
EN-HU
GL-ES
EN-FJ
HR-BA
AR-KW
BS-LATN-BA
EN-GB
JA-JP
EN-GD
SW-KE
AR-JO
ES-CO
HA-LATN-NG
EN-DJ
EN-GI
EN-MM
EN-VI
ML-IN
ES-US
EN-NZ
BN-BD
IT-SM
AF-ZA
FR-MA
AR-SA
EN-CA
SL-SI
EN-VN
EN-LB
ES-PA
EN-LC
ES-XL
HI-IN
EN-US
LT-LT
ES-PR
EN-BH
ES-CR
EN-XA
RU-BY

Просмотреть файл

@ -0,0 +1,6 @@
-,02/03/2015,02:20:01,4628deca-139d-4121-b540-8341b9c05c2a,en,us,835
-,02/03/2015,02:21:11,04b2cb96-20af-4fe3-bd0e-f474d5b077a8,en,us,456
-,02/04/2015,05:55:07,03991dd8-f9a6-4ac4-8fd7-d81c6c61ff0d,en,gb,1045
-,02/04/2015,15:05:37,764a0280-efb0-48a0-b64d-31e3b34918d1,en,sg,1256
-,02/04/2015,15:05:37,99d549fb-c5c5-40da-b1f0-5dc212ece02a,en,us,654
-,02/04/2015,15:05:37,ff8fae98-4318-48da-bd22-ac1bab9978f5,en,us,786

Просмотреть файл

@ -0,0 +1,2 @@
{"personid":"123", "orderid":"b123xyz", "order":{"itemcount":2,"totalamount":234.45}}
{"personid":"456", "orderid":"s456lmn", "order":{"itemcount":10,"totalamount":3045.23}}

Просмотреть файл

@ -0,0 +1,3 @@
{"id":"123", "name":"Bill", "age":34, "address":{"city":"Columbus","state":"Ohio"}}
{"id":"456", "name":"Steve", "age":14, "address":{"city":null, "state":"California"}}
{"id":"789", "name":"Bill", "age":43, "address":{"city":"Seattle","state":"Washington"}}

Просмотреть файл

@ -0,0 +1,10 @@
4628deca-139d-4121-b540-8341b9c05c2a,iowa,ABTest123,production
04b2cb96-20af-4fe3-bd0e-f474d5b077a8,texas,ABTest123,test
145ee055-62a3-4d7b-951d-a5b461f151c6,illinois,ABTest456,production
4dc71a6c-da79-4fc4-a72a-861f2809e6ad,osaka,ABTest123,production
dd7a906f-8aaf-423f-b7fe-69c48d59d29b,singapore,ABTest456,test
03991dd8-f9a6-4ac4-8fd7-d81c6c61ff0d,ireland,ABTest456,test
9f819282-fd7f-47b0-8e9c-f0d134c33d58,iowa,ABTest123,production
99d549fb-c5c5-40da-b1f0-5dc212ece02a,texas,ABTest789,production
764a0280-efb0-48a0-b64d-31e3b34918d1,singapore,ABTest789,production
ff8fae98-4318-48da-bd22-ac1bab9978f5,iowa,ABTest456,test

Просмотреть файл

@ -0,0 +1,11 @@
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The dog lazy

38
csharp/SparkCLR.sln Normal file
Просмотреть файл

@ -0,0 +1,38 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 2013
VisualStudioVersion = 12.0.30501.0
MinimumVisualStudioVersion = 10.0.40219.1
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Adapter", "Adapter\Microsoft.Spark.CSharp\Adapter.csproj", "{CE999A96-F42B-4E80-B208-709D7F49A77C}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Worker", "Worker\Microsoft.Spark.CSharp\Worker.csproj", "{82C9D3B2-E4FB-4713-B980-948C1E96A10A}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Samples", "Samples\Microsoft.Spark.CSharp\Samples.csproj", "{913E6A56-9839-4379-8B3C-855BA9341663}"
ProjectSection(ProjectDependencies) = postProject
{CE999A96-F42B-4E80-B208-709D7F49A77C} = {CE999A96-F42B-4E80-B208-709D7F49A77C}
{82C9D3B2-E4FB-4713-B980-948C1E96A10A} = {82C9D3B2-E4FB-4713-B980-948C1E96A10A}
EndProjectSection
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{CE999A96-F42B-4E80-B208-709D7F49A77C}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{CE999A96-F42B-4E80-B208-709D7F49A77C}.Debug|Any CPU.Build.0 = Debug|Any CPU
{CE999A96-F42B-4E80-B208-709D7F49A77C}.Release|Any CPU.ActiveCfg = Release|Any CPU
{CE999A96-F42B-4E80-B208-709D7F49A77C}.Release|Any CPU.Build.0 = Release|Any CPU
{82C9D3B2-E4FB-4713-B980-948C1E96A10A}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{82C9D3B2-E4FB-4713-B980-948C1E96A10A}.Debug|Any CPU.Build.0 = Debug|Any CPU
{82C9D3B2-E4FB-4713-B980-948C1E96A10A}.Release|Any CPU.ActiveCfg = Release|Any CPU
{82C9D3B2-E4FB-4713-B980-948C1E96A10A}.Release|Any CPU.Build.0 = Release|Any CPU
{913E6A56-9839-4379-8B3C-855BA9341663}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{913E6A56-9839-4379-8B3C-855BA9341663}.Debug|Any CPU.Build.0 = Debug|Any CPU
{913E6A56-9839-4379-8B3C-855BA9341663}.Release|Any CPU.ActiveCfg = Release|Any CPU
{913E6A56-9839-4379-8B3C-855BA9341663}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
EndGlobal

Просмотреть файл

@ -0,0 +1,327 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Reflection;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
using Microsoft.Spark.CSharp.Core;
using Microsoft.Spark.CSharp.Interop.Ipc;
using Razorvine.Pickle;
namespace Microsoft.Spark.CSharp
{
/// <summary>
/// Worker implementation for SparkCLR. The implementation is identical to the
/// worker used in PySpark. The RDD implementation to fork an external process
/// and pipe data in and out between JVM & the other runtime is already implemented in PySpark.
/// SparkCLR uses the same design and implementation of PythonRDD (CSharpRDD extends PythonRDD).
/// So the worker behavior is also the identical between PySpark and SparkCLR.
/// </summary>
public class Worker
{
private const int END_OF_DATA_SECTION = -1;
private const int DOTNET_EXCEPTION_THROWN = -2;
private const int TIMING_DATA = -3;
private const int END_OF_STREAM = -4;
private const int NULL = -5;
static void Main(string[] args)
{
PrintFiles();
int javaPort = int.Parse(Console.ReadLine());
Log("java_port: " + javaPort);
var socket = new SparkCLRSocket();
socket.Initialize(javaPort);
using (socket)
using (socket.InitializeStream())
{
try
{
DateTime bootTime = DateTime.UtcNow;
int splitIndex = socket.ReadInt();
Log("split_index: " + splitIndex);
if (splitIndex == -1)
Environment.Exit(-1);
int versionLength = socket.ReadInt();
Log("ver_len: " + versionLength);
if (versionLength > 0)
{
string ver = socket.ReadString(versionLength);
Log("ver: " + ver);
}
//// initialize global state
//shuffle.MemoryBytesSpilled = 0
//shuffle.DiskBytesSpilled = 0
//_accumulatorRegistry.clear()
// fetch name of workdir
int sparkFilesDirectoryLength = socket.ReadInt();
Log("sparkFilesDirectoryLength: " + sparkFilesDirectoryLength);
if (sparkFilesDirectoryLength > 0)
{
string sparkFilesDir = socket.ReadString(sparkFilesDirectoryLength);
Log("spark_files_dir: " + sparkFilesDir);
//SparkFiles._root_directory = spark_files_dir
//SparkFiles._is_running_on_worker = True
}
// fetch names of includes - not used //TODO - complete the impl
int numberOfIncludesItems = socket.ReadInt();
Log("num_includes: " + numberOfIncludesItems);
if (numberOfIncludesItems > 0)
{
for (int i = 0; i < numberOfIncludesItems; i++)
{
string filename = socket.ReadString();
}
}
// fetch names and values of broadcast variables
int numBroadcastVariables = socket.ReadInt();
Log("num_broadcast_variables: " + numBroadcastVariables);
if (numBroadcastVariables > 0)
{
for (int i = 0; i < numBroadcastVariables; i++)
{
long bid = socket.ReadLong();
if (bid >= 0)
{
string path = socket.ReadString();
Broadcast.broadcastRegistry[bid] = new Broadcast(path);
}
else
{
bid = -bid - 1;
Broadcast.broadcastRegistry.Remove(bid);
}
}
}
Accumulator.accumulatorRegistry.Clear();
int lengthOCommandByteArray = socket.ReadInt();
Log("command_len: " + lengthOCommandByteArray);
IFormatter formatter = new BinaryFormatter();
if (lengthOCommandByteArray > 0)
{
int length = socket.ReadInt();
Log("Deserializer mode length: " + length);
string deserializerMode = socket.ReadString(length);
Log("Deserializer mode: " + deserializerMode);
length = socket.ReadInt();
Log("Serializer mode length: " + length);
string serializerMode = socket.ReadString(length);
Log("Serializer mode: " + serializerMode);
int lengthOfFunc = socket.ReadInt();
Log("Length of func: " + lengthOfFunc);
byte[] command = socket.ReadBytes(lengthOfFunc);
Log("command bytes read: " + command.Length);
var stream = new MemoryStream(command);
var func = (Func<int, IEnumerable<dynamic>, IEnumerable<dynamic>>)formatter.Deserialize(stream);
DateTime initTime = DateTime.UtcNow;
int count = 0;
foreach (var message in func(splitIndex, GetIterator(socket, deserializerMode)))
{
byte[] buffer;
if (serializerMode == "None")
{
buffer = message as byte[];
}
else if (serializerMode == "String")
{
buffer = SerDe.ToBytes(message as string);
}
else if (serializerMode == "Row")
{
Pickler pickler = new Pickler();
buffer = pickler.dumps(new ArrayList { message });
}
else
{
try
{
var ms = new MemoryStream();
formatter.Serialize(ms, message);
buffer = ms.ToArray();
}
catch (Exception)
{
Log(string.Format("{0} : {1}", message.GetType().Name, message.GetType().FullName));
throw;
}
}
count++;
socket.Write(buffer.Length);
socket.Write(buffer);
}
//TODO - complete the impl
Log("Count: " + count);
//if profiler:
// profiler.profile(process)
//else:
// process()
DateTime finish_time = DateTime.UtcNow;
socket.Write(TIMING_DATA);
socket.Write(ToUnixTime(bootTime));
socket.Write(ToUnixTime(initTime));
socket.Write(ToUnixTime(finish_time));
socket.Write(0L); //shuffle.MemoryBytesSpilled
socket.Write(0L); //shuffle.DiskBytesSpilled
}
else
{
Log("Nothing to execute :-(");
}
//// Mark the beginning of the accumulators section of the output
socket.Write(END_OF_DATA_SECTION);
socket.Write(Accumulator.accumulatorRegistry.Count);
foreach (var item in Accumulator.accumulatorRegistry)
{
var ms = new MemoryStream();
var value = item.Value.GetType().GetField("value", BindingFlags.NonPublic | BindingFlags.Instance).GetValue(item.Value);
Log(string.Format("({0}, {1})", item.Key, value));
formatter.Serialize(ms, new KeyValuePair<int, dynamic>(item.Key, value));
byte[] buffer = ms.ToArray();
socket.Write(buffer.Length);
socket.Write(buffer);
}
int end = socket.ReadInt();
// check end of stream
if (end == END_OF_DATA_SECTION || end == END_OF_STREAM)
{
socket.Write(END_OF_STREAM);
Log("END_OF_STREAM: " + END_OF_STREAM);
}
else
{
// write a different value to tell JVM to not reuse this worker
socket.Write(END_OF_DATA_SECTION);
Environment.Exit(-1);
}
socket.Flush();
System.Threading.Thread.Sleep(1000); //TODO - not sure if this is really needed
}
catch (Exception e)
{
Log(e.ToString());
try
{
socket.Write(e.ToString());
}
catch (IOException)
{
// JVM close the socket
}
catch (Exception ex)
{
LogError("CSharpWorker failed with exception:");
LogError(ex.ToString());
}
Environment.Exit(-1);
}
}
}
private static void PrintFiles()
{
Console.WriteLine("Files available in executor");
var driverFolder = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location);
var files = Directory.EnumerateFiles(driverFolder);
foreach (var file in files)
{
Console.WriteLine(file);
}
}
private static long ToUnixTime(DateTime dt)
{
return (long)(dt - new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc)).TotalMilliseconds;
}
private static IEnumerable<dynamic> GetIterator(ISparkCLRSocket socket, string serializedMode)
{
Log("Serialized mode in GetIterator: " + serializedMode);
IFormatter formatter = new BinaryFormatter();
int messageLength;
do
{
messageLength = socket.ReadInt();
if (messageLength > 0)
{
byte[] buffer = socket.ReadBytes(messageLength);
switch (serializedMode)
{
case "String":
yield return SerDe.ToString(buffer);
break;
case "Row":
Unpickler unpickler = new Unpickler();
foreach (var item in (unpickler.loads(buffer) as object[]))
{
yield return item;
}
break;
case "Pair":
messageLength = socket.ReadInt();
byte[] value = socket.ReadBytes(messageLength);
yield return new KeyValuePair<byte[], byte[]>(buffer, value);
break;
case "Byte":
default:
var ms = new MemoryStream(buffer);
dynamic message = formatter.Deserialize(ms);
yield return message;
break;
}
}
} while (messageLength >= 0);
}
private static void Log(string message)
{
Console.WriteLine(message);
}
private static void LogError(string message)
{
Console.WriteLine(message);
}
}
}

Просмотреть файл

@ -0,0 +1,73 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Import Project="$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props" Condition="Exists('$(MSBuildExtensionsPath)\$(MSBuildToolsVersion)\Microsoft.Common.props')" />
<PropertyGroup>
<Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
<Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform>
<ProjectGuid>{82C9D3B2-E4FB-4713-B980-948C1E96A10A}</ProjectGuid>
<OutputType>Exe</OutputType>
<AppDesignerFolder>Properties</AppDesignerFolder>
<RootNamespace>Microsoft.Spark.CSharp</RootNamespace>
<AssemblyName>CSharpWorker</AssemblyName>
<TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
<FileAlignment>512</FileAlignment>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Debug|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugSymbols>true</DebugSymbols>
<DebugType>full</DebugType>
<Optimize>false</Optimize>
<OutputPath>bin\Debug\</OutputPath>
<DefineConstants>DEBUG;TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
</PropertyGroup>
<PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Release|AnyCPU' ">
<PlatformTarget>AnyCPU</PlatformTarget>
<DebugType>pdbonly</DebugType>
<Optimize>true</Optimize>
<OutputPath>bin\Release\</OutputPath>
<DefineConstants>TRACE</DefineConstants>
<ErrorReport>prompt</ErrorReport>
<WarningLevel>4</WarningLevel>
</PropertyGroup>
<ItemGroup>
<Reference Include="Razorvine.Pyrolite">
<HintPath>..\..\packages\Razorvine.Pyrolite.4.10.0.0\lib\net40\Razorvine.Pyrolite.dll</HintPath>
</Reference>
<Reference Include="Razorvine.Serpent">
<HintPath>..\..\packages\Razorvine.Serpent.1.12.0.0\lib\net40\Razorvine.Serpent.dll</HintPath>
</Reference>
<Reference Include="System" />
<Reference Include="System.Core" />
<Reference Include="System.Runtime.Serialization" />
<Reference Include="System.Xml.Linq" />
<Reference Include="System.Data.DataSetExtensions" />
<Reference Include="Microsoft.CSharp" />
<Reference Include="System.Data" />
<Reference Include="System.Xml" />
</ItemGroup>
<ItemGroup>
<Compile Include="Worker.cs" />
</ItemGroup>
<ItemGroup>
<Folder Include="Properties\" />
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\..\Adapter\Microsoft.Spark.CSharp\Adapter.csproj">
<Project>{ce999a96-f42b-4e80-b208-709d7f49a77c}</Project>
<Name>Adapter</Name>
</ProjectReference>
</ItemGroup>
<ItemGroup>
<None Include="packages.config" />
</ItemGroup>
<Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />
<!-- To modify your build process, add your task inside one of the targets below and uncomment it.
Other similar extension points exist, see Microsoft.Common.targets.
<Target Name="BeforeBuild">
</Target>
<Target Name="AfterBuild">
</Target>
-->
</Project>

Просмотреть файл

@ -0,0 +1,5 @@
<?xml version="1.0" encoding="utf-8"?>
<packages>
<package id="Razorvine.Pyrolite" version="4.10.0.0" targetFramework="net45" />
<package id="Razorvine.Serpent" version="1.12.0.0" targetFramework="net45" />
</packages>

96
scala/pom.xml Normal file
Просмотреть файл

@ -0,0 +1,96 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.microsoft.spark</groupId>
<artifactId>spark-clr</artifactId>
<version>${spark.version}-SNAPSHOT</version>
<name>${project.artifactId}</name>
<description>C# language binding and extensions to Apache Spark</description>
<inceptionYear>2015</inceptionYear>
<licenses>
<license>
<name>MIT License</name>
<url>https://github.com/Microsoft/SparkCLR/blob/master/LICENSE</url>
<distribution>repo</distribution>
</license>
</licenses>
<properties>
<maven.compiler.source>1.5</maven.compiler.source>
<maven.compiler.target>1.5</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.10.4</scala.version>
<spark.version>1.4.1</spark.version>
<scala.binary.version>2.10</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-actors</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scalap</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main</sourceDirectory>
<testSourceDirectory>src/test</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</build>
</project>

Просмотреть файл

@ -0,0 +1,81 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.api.csharp
import java.net.InetSocketAddress
import java.util.concurrent.TimeUnit
import io.netty.bootstrap.ServerBootstrap
import io.netty.channel.nio.NioEventLoopGroup
import io.netty.channel.socket.SocketChannel
import io.netty.channel.socket.nio.NioServerSocketChannel
import io.netty.channel.{ChannelInitializer, EventLoopGroup, ChannelFuture}
import io.netty.handler.codec.LengthFieldBasedFrameDecoder
import io.netty.handler.codec.bytes.{ByteArrayDecoder, ByteArrayEncoder}
/**
* Netty server that invokes JVM calls based upon receiving
* messages from C# in SparkCLR.
* This implementation is identical to RBackend and that can be reused
* in SparkCLR if the handler is made pluggable
*/
// Since SparkCLR is a package to Spark and not a part of spark-core it mirrors the implementation of
// selected parts from RBackend with SparkCLR customizations
class CSharpBackend {
private[this] var channelFuture: ChannelFuture = null
private[this] var bootstrap: ServerBootstrap = null
private[this] var bossGroup: EventLoopGroup = null
def init(): Int = {
bossGroup = new NioEventLoopGroup(2)
val workerGroup = bossGroup
val handler = new CSharpBackendHandler(this) //TODO - work with SparkR devs to make this configurable and reuse RBackend
bootstrap = new ServerBootstrap()
.group(bossGroup, workerGroup)
.channel(classOf[NioServerSocketChannel])
bootstrap.childHandler(new ChannelInitializer[SocketChannel]() {
def initChannel(ch: SocketChannel): Unit = {
ch.pipeline()
.addLast("encoder", new ByteArrayEncoder())
.addLast("frameDecoder",
// maxFrameLength = 2G
// lengthFieldOffset = 0
// lengthFieldLength = 4
// lengthAdjustment = 0
// initialBytesToStrip = 4, i.e. strip out the length field itself
//new LengthFieldBasedFrameDecoder(Integer.MAX_VALUE, 0, 4, 0, 4))
new LengthFieldBasedFrameDecoder(Integer.MAX_VALUE, 0, 4, 0, 4))
.addLast("decoder", new ByteArrayDecoder())
.addLast("handler", handler)
}
})
channelFuture = bootstrap.bind(new InetSocketAddress("localhost", 0))
channelFuture.syncUninterruptibly()
channelFuture.channel().localAddress().asInstanceOf[InetSocketAddress].getPort()
}
def run(): Unit = {
channelFuture.channel.closeFuture().syncUninterruptibly()
}
def close(): Unit = {
if (channelFuture != null) {
// close is a local operation and should finish within milliseconds; timeout just to be safe
channelFuture.channel().close().awaitUninterruptibly(10, TimeUnit.SECONDS)
channelFuture = null
}
if (bootstrap != null && bootstrap.group() != null) {
bootstrap.group().shutdownGracefully()
}
if (bootstrap != null && bootstrap.childGroup() != null) {
bootstrap.childGroup().shutdownGracefully()
}
bootstrap = null
}
}

Просмотреть файл

@ -0,0 +1,228 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.api.csharp
import org.apache.spark.util.Utils
import java.io.{DataOutputStream, ByteArrayOutputStream, DataInputStream, ByteArrayInputStream}
import io.netty.channel.{ChannelHandlerContext, SimpleChannelInboundHandler}
import org.apache.spark.api.csharp.SerDe._ //TODO - work with SparkR devs to make this configurable and reuse RBackendHandler
import scala.collection.mutable.HashMap
import scala.collection.mutable.HashMap
/**
* Handler for CSharpBackend.
* This implementation is identical to RBackendHandler and that can be reused
* in SparkCLR if SerDe is made pluggable
*/
// Since SparkCLR is a package to Spark and not a part of spark-core it mirrors the implementation of
// selected parts from RBackend with SparkCLR customizations
class CSharpBackendHandler(server: CSharpBackend) extends SimpleChannelInboundHandler[Array[Byte]] {
override def channelRead0(ctx: ChannelHandlerContext, msg: Array[Byte]): Unit = {
val bis = new ByteArrayInputStream(msg)
val dis = new DataInputStream(bis)
val bos = new ByteArrayOutputStream()
val dos = new DataOutputStream(bos)
// First bit is isStatic
val isStatic = readBoolean(dis)
val objId = readString(dis)
val methodName = readString(dis)
val numArgs = readInt(dis)
if (objId == "SparkCLRHandler") {
methodName match {
case "stopBackend" =>
writeInt(dos, 0)
writeType(dos, "void")
server.close()
case "rm" =>
try {
val t = readObjectType(dis)
assert(t == 'c')
val objToRemove = readString(dis)
JVMObjectTracker.remove(objToRemove)
writeInt(dos, 0)
writeObject(dos, null)
} catch {
case e: Exception =>
logError(s"Removing $objId failed", e)
writeInt(dos, -1)
}
case _ => dos.writeInt(-1)
}
} else {
handleMethodCall(isStatic, objId, methodName, numArgs, dis, dos)
}
val reply = bos.toByteArray
ctx.write(reply)
}
override def channelReadComplete(ctx: ChannelHandlerContext): Unit = {
ctx.flush()
}
override def exceptionCaught(ctx: ChannelHandlerContext, cause: Throwable): Unit = {
// Close the connection when an exception is raised.
println("Exception caught: " + cause.getMessage)
cause.printStackTrace()
ctx.close()
}
def handleMethodCall(
isStatic: Boolean,
objId: String,
methodName: String,
numArgs: Int,
dis: DataInputStream,
dos: DataOutputStream): Unit = {
var obj: Object = null
try {
val cls = if (isStatic) {
Utils.classForName(objId)
} else {
JVMObjectTracker.get(objId) match {
case None => throw new IllegalArgumentException("Object not found " + objId)
case Some(o) =>
obj = o
o.getClass
}
}
val args = readArgs(numArgs, dis)
val methods = cls.getMethods
val selectedMethods = methods.filter(m => m.getName == methodName)
if (selectedMethods.length > 0) {
val methods = selectedMethods.filter { x =>
matchMethod(numArgs, args, x.getParameterTypes)
}
if (methods.isEmpty) {
logWarning(s"cannot find matching method ${cls}.$methodName. "
+ s"Candidates are:")
selectedMethods.foreach { method =>
logWarning(s"$methodName(${method.getParameterTypes.mkString(",")})")
}
throw new Exception(s"No matched method found for $cls.$methodName")
}
val ret = methods.head.invoke(obj, args : _*)
// Write status bit
writeInt(dos, 0)
writeObject(dos, ret.asInstanceOf[AnyRef])
} else if (methodName == "<init>") {
// methodName should be "<init>" for constructor
val ctor = cls.getConstructors.filter { x =>
matchMethod(numArgs, args, x.getParameterTypes)
}.head
val obj = ctor.newInstance(args : _*)
writeInt(dos, 0)
writeObject(dos, obj.asInstanceOf[AnyRef])
} else {
throw new IllegalArgumentException("invalid method " + methodName + " for object " + objId)
}
} catch {
case e: Exception =>
logError(s"$methodName on $objId failed", e)
println(e.getMessage)
println(e.printStackTrace())
writeInt(dos, -1)
}
}
// Read a number of arguments from the data input stream
def readArgs(numArgs: Int, dis: DataInputStream): Array[java.lang.Object] = {
(0 until numArgs).map { arg =>
readObject(dis)
}.toArray
}
// Checks if the arguments passed in args matches the parameter types.
// NOTE: Currently we do exact match. We may add type conversions later.
def matchMethod(
numArgs: Int,
args: Array[java.lang.Object],
parameterTypes: Array[Class[_]]): Boolean = {
if (parameterTypes.length != numArgs) {
return false
}
for (i <- 0 to numArgs - 1) {
val parameterType = parameterTypes(i)
var parameterWrapperType = parameterType
// Convert native parameters to Object types as args is Array[Object] here
if (parameterType.isPrimitive) {
parameterWrapperType = parameterType match {
case java.lang.Integer.TYPE => classOf[java.lang.Integer]
case java.lang.Double.TYPE => classOf[java.lang.Double]
case java.lang.Boolean.TYPE => classOf[java.lang.Boolean]
case _ => parameterType
}
}
if (!parameterWrapperType.isInstance(args(i))) {
//if (!parameterWrapperType.isAssignableFrom(args(i).getClass)) {
if (!parameterType.isPrimitive && args(i) != null) {
return false
}
//}
}
}
true
}
def logError(id: String) {
println(id)
}
def logWarning(id: String) {
println(id)
}
def logError(id: String, e: Exception): Unit = {
}
}
/**
* Tracks JVM objects returned to C# which is useful for invoking calls from C# to JVM objects
*/
private object JVMObjectTracker {
// TODO: This map should be thread-safe if we want to support multiple
// connections at the same time
private[this] val objMap = new HashMap[String, Object]
// TODO: We support only one connection now, so an integer is fine.
// Investigate using use atomic integer in the future.
private[this] var objCounter: Int = 1
def getObject(id: String): Object = {
objMap(id)
}
def get(id: String): Option[Object] = {
objMap.get(id)
}
def put(obj: Object): String = {
val objId = objCounter.toString
objCounter = objCounter + 1
objMap.put(objId, obj)
objId
}
def remove(id: String): Option[Object] = {
objMap.remove(id)
}
}

Просмотреть файл

@ -0,0 +1,36 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.api.csharp
import java.util.{ArrayList => JArrayList, List => JList, Map => JMap}
import org.apache.spark.api.python.{PythonBroadcast, PythonRDD}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{Accumulator, SparkContext}
/**
* RDD used for forking an external C# process and pipe in & out the data
* between JVM and CLR. Since PythonRDD already has the required implementation
* it just extends from it without overriding any behavior for now
*/
class CSharpRDD(
@transient parent: RDD[_],
command: Array[Byte],
envVars: JMap[String, String],
cSharpIncludes: JList[String],
preservePartitoning: Boolean,
cSharpExec: String,
cSharpVer: String,
broadcastVars: JList[Broadcast[PythonBroadcast]],
accumulator: Accumulator[JList[Array[Byte]]])
extends PythonRDD (parent, command, envVars, cSharpIncludes, preservePartitoning, cSharpExec, cSharpVer, broadcastVars, accumulator) {
}
object CSharpRDD {
def createRDDFromArray(sc: SparkContext, arr: Array[Array[Byte]], numSlices: Int): RDD[Array[Byte]] = {
sc.parallelize(arr, numSlices)
}
}

Просмотреть файл

@ -0,0 +1,350 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.api.csharp
import java.io.{DataOutputStream, DataInputStream}
import java.sql.{Time, Timestamp, Date}
import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
/**
* Functions to serialize and deserialize between CLR & JVM.
* This implementation of methods is mostly identical to the SerDe implementation in R.
*/
//TODO look into the possibility of reusing SerDe from R implementation
object SerDe {
def readObjectType(dis: DataInputStream): Char = {
dis.readByte().toChar
}
def readObject(dis: DataInputStream): Object = {
val dataType = readObjectType(dis)
readTypedObject(dis, dataType)
}
def readTypedObject(
dis: DataInputStream,
dataType: Char): Object = {
dataType match {
case 'n' => null
case 'i' => new java.lang.Integer(readInt(dis))
case 'g' => new java.lang.Long(readLong(dis))
case 'd' => new java.lang.Double(readDouble(dis))
case 'b' => new java.lang.Boolean(readBoolean(dis))
case 'c' => readString(dis)
case 'e' => readMap(dis)
case 'r' => readBytes(dis)
case 'l' => readList(dis)
case 'D' => readDate(dis)
case 't' => readTime(dis)
case 'j' => JVMObjectTracker.getObject(readString(dis))
case _ => throw new IllegalArgumentException(s"Invalid type $dataType")
}
}
def readBytes(in: DataInputStream): Array[Byte] = {
val len = readInt(in)
val out = new Array[Byte](len)
val bytesRead = in.readFully(out)
out
}
def readInt(in: DataInputStream): Int = {
in.readInt()
}
def readLong(in: DataInputStream): Long = {
in.readLong()
}
def readDouble(in: DataInputStream): Double = {
in.readDouble()
}
def readStringBytes(in: DataInputStream, len: Int): String = {
val bytes = new Array[Byte](len)
in.readFully(bytes)
//assert(bytes(len - 1) == 0)
val str = new String(bytes/*.dropRight(1)*/, "UTF-8")
str
}
def readString(in: DataInputStream): String = {
val len = in.readInt()
readStringBytes(in, len)
}
def readBoolean(in: DataInputStream): Boolean = {
//val intVal = in.readInt()
//if (intVal == 0) false else true
return in.readBoolean()
}
def readDate(in: DataInputStream): Date = {
Date.valueOf(readString(in))
}
def readTime(in: DataInputStream): Timestamp = {
val seconds = in.readDouble()
val sec = Math.floor(seconds).toLong
val t = new Timestamp(sec * 1000L)
t.setNanos(((seconds - sec) * 1e9).toInt)
t
}
def readBytesArr(in: DataInputStream): Array[Array[Byte]] = {
val len = readInt(in)
(0 until len).map(_ => readBytes(in)).toArray
}
def readIntArr(in: DataInputStream): Array[Int] = {
val len = readInt(in)
(0 until len).map(_ => readInt(in)).toArray
}
def readLongArr(in: DataInputStream): Array[Long] = {
val len = readInt(in)
(0 until len).map(_ => readLong(in)).toArray
}
def readDoubleArr(in: DataInputStream): Array[Double] = {
val len = readInt(in)
(0 until len).map(_ => readDouble(in)).toArray
}
def readBooleanArr(in: DataInputStream): Array[Boolean] = {
val len = readInt(in)
(0 until len).map(_ => readBoolean(in)).toArray
}
def readStringArr(in: DataInputStream): Array[String] = {
val len = readInt(in)
(0 until len).map(_ => readString(in)).toArray
}
def readList(dis: DataInputStream): Array[_] = {
val arrType = readObjectType(dis)
arrType match {
case 'i' => readIntArr(dis)
case 'g' => readLongArr(dis)
case 'c' => readStringArr(dis)
case 'd' => readDoubleArr(dis)
case 'b' => readBooleanArr(dis)
case 'j' => readStringArr(dis).map(x => JVMObjectTracker.getObject(x))
case 'r' => readBytesArr(dis)
case _ => throw new IllegalArgumentException(s"Invalid array type $arrType")
}
}
def readMap(in: DataInputStream): java.util.Map[Object, Object] = {
val len = readInt(in)
if (len > 0) {
val keysType = readObjectType(in)
val keysLen = readInt(in)
val keys = (0 until keysLen).map(_ => readTypedObject(in, keysType))
val valuesLen = readInt(in)
val values = (0 until valuesLen).map(_ => {
val valueType = readObjectType(in)
readTypedObject(in, valueType)
})
mapAsJavaMap(keys.zip(values).toMap)
} else {
new java.util.HashMap[Object, Object]()
}
}
//Using the same mapping as SparkR implementation for now
// Methods to write out data from Java to C#
//
// Type mapping from Java to C#
//
// void -> NULL
// Int -> integer
// String -> character
// Boolean -> logical
// Float -> double
// Double -> double
// Long -> double
// Array[Byte] -> raw
// Date -> Date
// Time -> POSIXct
//
// Array[T] -> list()
// Object -> jobj
def writeType(dos: DataOutputStream, typeStr: String): Unit = {
typeStr match {
case "void" => dos.writeByte('n')
case "character" => dos.writeByte('c')
case "double" => dos.writeByte('d')
case "integer" => dos.writeByte('i')
case "logical" => dos.writeByte('b')
case "date" => dos.writeByte('D')
case "time" => dos.writeByte('t')
case "raw" => dos.writeByte('r')
case "list" => dos.writeByte('l')
case "jobj" => dos.writeByte('j')
case _ => throw new IllegalArgumentException(s"Invalid type $typeStr")
}
}
def writeObject(dos: DataOutputStream, value: Object): Unit = {
if (value == null) {
writeType(dos, "void")
} else {
value.getClass.getName match {
case "java.lang.String" =>
writeType(dos, "character")
writeString(dos, value.asInstanceOf[String])
case "long" | "java.lang.Long" =>
writeType(dos, "double")
writeDouble(dos, value.asInstanceOf[Long].toDouble)
case "float" | "java.lang.Float" =>
writeType(dos, "double")
writeDouble(dos, value.asInstanceOf[Float].toDouble)
case "double" | "java.lang.Double" =>
writeType(dos, "double")
writeDouble(dos, value.asInstanceOf[Double])
case "int" | "java.lang.Integer" =>
writeType(dos, "integer")
writeInt(dos, value.asInstanceOf[Int])
case "boolean" | "java.lang.Boolean" =>
writeType(dos, "logical")
writeBoolean(dos, value.asInstanceOf[Boolean])
case "java.sql.Date" =>
writeType(dos, "date")
writeDate(dos, value.asInstanceOf[Date])
case "java.sql.Time" =>
writeType(dos, "time")
writeTime(dos, value.asInstanceOf[Time])
case "java.sql.Timestamp" =>
writeType(dos, "time")
writeTime(dos, value.asInstanceOf[Timestamp])
case "[B" =>
writeType(dos, "raw")
writeBytes(dos, value.asInstanceOf[Array[Byte]])
// TODO: Types not handled right now include
// byte, char, short, float
// Handle arrays
case "[Ljava.lang.String;" =>
writeType(dos, "list")
writeStringArr(dos, value.asInstanceOf[Array[String]])
case "[I" =>
writeType(dos, "list")
writeIntArr(dos, value.asInstanceOf[Array[Int]])
case "[J" =>
writeType(dos, "list")
writeDoubleArr(dos, value.asInstanceOf[Array[Long]].map(_.toDouble))
case "[D" =>
writeType(dos, "list")
writeDoubleArr(dos, value.asInstanceOf[Array[Double]])
case "[Z" =>
writeType(dos, "list")
writeBooleanArr(dos, value.asInstanceOf[Array[Boolean]])
case "[[B" =>
writeType(dos, "list")
writeBytesArr(dos, value.asInstanceOf[Array[Array[Byte]]])
case otherName =>
// Handle array of objects
if (otherName.startsWith("[L")) {
val objArr = value.asInstanceOf[Array[Object]]
writeType(dos, "list")
writeType(dos, "jobj")
dos.writeInt(objArr.length)
objArr.foreach(o => writeJObj(dos, o))
} else {
writeType(dos, "jobj")
writeJObj(dos, value)
}
}
}
}
def writeInt(out: DataOutputStream, value: Int): Unit = {
out.writeInt(value)
}
def writeDouble(out: DataOutputStream, value: Double): Unit = {
out.writeDouble(value)
}
def writeBoolean(out: DataOutputStream, value: Boolean): Unit = {
//val intValue = if (value) 1 else 0
//out.writeInt(intValue)
out.writeBoolean(value)
}
def writeDate(out: DataOutputStream, value: Date): Unit = {
writeString(out, value.toString)
}
def writeTime(out: DataOutputStream, value: Time): Unit = {
out.writeDouble(value.getTime.toDouble / 1000.0)
}
def writeTime(out: DataOutputStream, value: Timestamp): Unit = {
out.writeDouble((value.getTime / 1000).toDouble + value.getNanos.toDouble / 1e9)
}
// NOTE: Only works for ASCII right now
def writeString(out: DataOutputStream, value: String): Unit = {
/*val len = value.length
out.writeInt(len + 1) // For the \0
out.writeBytes(value)
out.writeByte(0)*/
val len = value.length
out.writeInt(len)
out.writeBytes(value)
}
def writeBytes(out: DataOutputStream, value: Array[Byte]): Unit = {
out.writeInt(value.length)
out.write(value)
}
def writeJObj(out: DataOutputStream, value: Object): Unit = {
val objId = JVMObjectTracker.put(value)
writeString(out, objId)
}
def writeIntArr(out: DataOutputStream, value: Array[Int]): Unit = {
writeType(out, "integer")
out.writeInt(value.length)
value.foreach(v => out.writeInt(v))
}
def writeDoubleArr(out: DataOutputStream, value: Array[Double]): Unit = {
writeType(out, "double")
out.writeInt(value.length)
value.foreach(v => out.writeDouble(v))
}
def writeBooleanArr(out: DataOutputStream, value: Array[Boolean]): Unit = {
writeType(out, "logical")
out.writeInt(value.length)
value.foreach(v => writeBoolean(out, v))
}
def writeStringArr(out: DataOutputStream, value: Array[String]): Unit = {
writeType(out, "character")
out.writeInt(value.length)
value.foreach(v => writeString(out, v))
}
def writeBytesArr(out: DataOutputStream, value: Array[Array[Byte]]): Unit = {
writeType(out, "raw")
out.writeInt(value.length)
value.foreach(v => writeBytes(out, v))
}
}
private object SerializationFormats {
val BYTE = "byte"
val STRING = "string"
val ROW = "row"
}

Просмотреть файл

@ -0,0 +1,108 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.deploy.csharp
import java.io.File
import java.util.concurrent.{Semaphore, TimeUnit}
import org.apache.spark.api.csharp.CSharpBackend
import org.apache.spark.deploy.{SparkSubmitArguments, PythonRunner}
import org.apache.spark.util.{Utils, RedirectThread}
/**
* Launched by sparkclr-submit.cmd. It launches CSharpBackend, gets its port number and launches C# process
* passing the port number to it.
* The runner implementation is mostly identical to RRunner with SparkCLR-specific customizations
*/
object CSharpRunner {
def main(args: Array[String]): Unit = {
//determines if CSharpBackend need to be run in debug mode
//in debug mode this runner will not launch C# process
var runInDebugMode = false
if (args.length == 0) {
throw new IllegalArgumentException("At least one argument is expected for CSharpRunner")
}
if (args.length == 1 && args(0).equalsIgnoreCase("debug")) {
runInDebugMode = true
println("Debug mode is set. CSharp executable will not be launched as a sub-process.")
}
var csharpExecutable = ""
if (!runInDebugMode) {
csharpExecutable = PythonRunner.formatPath(args(0)) //reusing windows-specific formatting in PythonRunner
}
val otherArgs = args.slice(1, args.length)
var processParameters = new java.util.ArrayList[String]()
processParameters.add(csharpExecutable)
otherArgs.foreach( arg => processParameters.add(arg) )
println("Starting CSharpBackend!")
// Time to wait for CSharpBackend to initialize in seconds
val backendTimeout = sys.env.getOrElse("CSHARPBACKEND_TIMEOUT", "120").toInt
// Launch a SparkCLR backend server for the C# process to connect to; this will let it see our
// Java system properties etc.
val csharpBackend = new CSharpBackend()
@volatile var csharpBackendPortNumber = 0
val initialized = new Semaphore(0)
val csharpBackendThread = new Thread("CSharpBackend") {
override def run() {
csharpBackendPortNumber = csharpBackend.init()
println("Port number used by CSharpBackend is " + csharpBackendPortNumber) //TODO - send to logger also
initialized.release()
csharpBackend.run()
}
}
csharpBackendThread.start()
if (initialized.tryAcquire(backendTimeout, TimeUnit.SECONDS)) {
if (!runInDebugMode) {
val returnCode = try {
val builder = new ProcessBuilder(processParameters)
val env = builder.environment()
env.put("CSHARPBACKEND_PORT", csharpBackendPortNumber.toString)
for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
env.put(key, value)
println("adding key=" + key + " and value=" + value + " to environment")
}
builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
val process = builder.start()
new RedirectThread(process.getInputStream, System.out, "redirect CSharp output").start()
process.waitFor()
} catch {
case e: Exception => println(e.getMessage + "\n" + e.getStackTrace)
}
finally {
closeBackend(csharpBackend)
}
println("Return CSharpBackend code " + returnCode)
System.exit(returnCode.toString.toInt)
} else {
println("***************************************************")
println("* Backend running debug mode. Press enter to exit *")
println("***************************************************")
Console.readLine()
closeBackend(csharpBackend)
System.exit(0)
}
} else {
// scalastyle:off println
println("CSharpBackend did not initialize in " + backendTimeout + " seconds")
// scalastyle:on println
System.exit(-1)
}
}
def closeBackend(csharpBackend: CSharpBackend): Unit = {
println("closing CSharpBackend")
csharpBackend.close()
}
}

Просмотреть файл

@ -0,0 +1,174 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.deploy.csharp
import java.io.File
import java.lang.reflect.{InvocationTargetException, UndeclaredThrowableException, Modifier}
import java.net.URL
import java.security.PrivilegedExceptionAction
import org.apache.hadoop.security.UserGroupInformation
import org.apache.spark.deploy.rest.SubmitRestConnectionException
import org.apache.spark.deploy.{SparkSubmit, SparkSubmitAction, SparkSubmitArguments}
import org.apache.spark.util.{ChildFirstURLClassLoader, Utils, MutableURLClassLoader}
import scala.collection.mutable.Map
/**
* Used to submit, kill or request status of SparkCLR applications.
* The implementation is a simpler version of SparkSubmit and
* "handles setting up the classpath with relevant Spark dependencies and provides
* a layer over the different cluster managers and deploy modes that Spark supports".
*/
// Since SparkCLR is a package to Spark and not a part of spark-core it reimplements
// selected parts from SparkSubmit with SparkCLR customizations
object SparkCLRSubmit {
def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
//case SparkSubmitAction.KILL => kill(appArgs)
//case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
def submit(args: SparkSubmitArguments): Unit = {
val (childArgs, childClasspath, sysProps, childMainClass) = SparkSubmit.prepareSubmitEnvironment(args)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
case e: Exception =>
// Hadoop's AuthorizationException suppresses the exception's stack trace, which
// makes the message printed to the output by the JVM not very helpful. Instead,
// detect exceptions with empty stack traces here, and treat them differently.
if (e.getStackTrace().length == 0) {
// scalastyle:off println
//printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
// scalastyle:on println
//exitFn(1)
} else {
throw e
}
}
} else {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}
// In standalone cluster mode, there are two submission gateways:
// (1) The traditional Akka gateway using o.a.s.deploy.Client as a wrapper
// (2) The new REST-based gateway introduced in Spark 1.3
// The latter is the default behavior as of Spark 1.3, but Spark submit will fail over
// to use the legacy gateway if the master endpoint turns out to be not a REST server.
if (args.isStandaloneCluster && args.useRest) {
try {
// scalastyle:off println
//printStream.println("Running Spark using the REST application submission protocol.")
// scalastyle:on println
doRunMain()
} catch {
// Fail over to use the legacy submission gateway
case e: SubmitRestConnectionException =>
//printWarning(s"Master endpoint ${args.master} was not a REST server. " +
//"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args)
}
// In all other modes, just run the main class as prepared
} else {
doRunMain()
}
}
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
val loader =
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
for ((key, value) <- sysProps) {
println("key=" + key + ", value=" + value)
System.setProperty(key, value)
}
var mainClass: Class[_] = null
try {
mainClass = Utils.classForName("org.apache.spark.deploy.csharp.CSharpRunner")
} catch {
case e: ClassNotFoundException =>
/* e.printStackTrace(printStream)
if (childMainClass.contains("thriftserver")) {
// scalastyle:off println
printStream.println(s"Failed to load main class $childMainClass.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
// scalastyle:on println
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)*/
}
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}
try {
println("Invoking Main method with following args")
childArgs.foreach( s => println(s))
mainMethod.invoke(null, childArgs.toArray)
} catch {
case t: Throwable =>
throw findCause(t)
}
}
private def addJarToClasspath(localJar: String, loader: MutableURLClassLoader) {
val uri = Utils.resolveURI(localJar)
uri.getScheme match {
case "file" | "local" =>
val file = new File(uri.getPath)
if (file.exists()) {
loader.addURL(file.toURI.toURL)
} else {
//printWarning(s"Local jar $file does not exist, skipping.")
}
case _ =>
//printWarning(s"Skip remote jar $uri.")
}
}
}

Просмотреть файл

@ -0,0 +1,220 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
package org.apache.spark.sql.api.csharp
import java.io.{ByteArrayInputStream, ByteArrayOutputStream, DataInputStream, DataOutputStream}
import org.apache.spark.SparkContext
import org.apache.spark.api.csharp.SerDe
import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{DataType, FloatType, StructField, StructType}
import org.apache.spark.sql._
/**
* Utility functions for DataFrame in SparkCLR
* The implementation is mostly identical to the SQLUtils used by R
* since CSharpSpark derives most of the design ideas and
* implementation constructs from SparkR
*/
object SQLUtils {
def createSQLContext(sc: SparkContext): SQLContext = {
new SQLContext(sc)
}
def getJavaSparkContext(sqlCtx: SQLContext): JavaSparkContext = {
new JavaSparkContext(sqlCtx.sparkContext)
}
def toSeq[T](arr: Array[T]): Seq[T] = {
arr.toSeq
}
def createStructType(fields : Seq[StructField]): StructType = {
StructType(fields)
}
def getSQLDataType(dataType: String): DataType = {
dataType match {
case "byte" => org.apache.spark.sql.types.ByteType
case "integer" => org.apache.spark.sql.types.IntegerType
case "float" => org.apache.spark.sql.types.FloatType
case "double" => org.apache.spark.sql.types.DoubleType
case "numeric" => org.apache.spark.sql.types.DoubleType
case "character" => org.apache.spark.sql.types.StringType
case "string" => org.apache.spark.sql.types.StringType
case "binary" => org.apache.spark.sql.types.BinaryType
case "raw" => org.apache.spark.sql.types.BinaryType
case "logical" => org.apache.spark.sql.types.BooleanType
case "boolean" => org.apache.spark.sql.types.BooleanType
case "timestamp" => org.apache.spark.sql.types.TimestampType
case "date" => org.apache.spark.sql.types.DateType
case _ => throw new IllegalArgumentException(s"Invaid type $dataType")
}
}
def createStructField(name: String, dataType: String, nullable: Boolean): StructField = {
val dtObj = getSQLDataType(dataType)
StructField(name, dtObj, nullable)
}
def createDF(rdd: RDD[Array[Byte]], schema: StructType, sqlContext: SQLContext): DataFrame = {
val num = schema.fields.size
val rowRDD = rdd.map(bytesToRow(_, schema))
sqlContext.createDataFrame(rowRDD, schema)
}
def dfToRowRDD(df: DataFrame): RDD[Array[Byte]] = {
df.map(r => rowToCSharpBytes(r))
}
private[this] def doConversion(data: Object, dataType: DataType): Object = {
data match {
case d: java.lang.Double if dataType == FloatType =>
new java.lang.Float(d)
case _ => data
}
}
private[this] def bytesToRow(bytes: Array[Byte], schema: StructType): Row = {
val bis = new ByteArrayInputStream(bytes)
val dis = new DataInputStream(bis)
val num = SerDe.readInt(dis)
Row.fromSeq((0 until num).map { i =>
doConversion(SerDe.readObject(dis), schema.fields(i).dataType)
}.toSeq)
}
private[this] def rowToCSharpBytes(row: Row): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val dos = new DataOutputStream(bos)
SerDe.writeInt(dos, row.length)
(0 until row.length).map { idx =>
val obj: Object = row(idx).asInstanceOf[Object]
SerDe.writeObject(dos, obj)
}
bos.toByteArray()
}
def dfToCols(df: DataFrame): Array[Array[Byte]] = {
// localDF is Array[Row]
val localDF = df.collect()
val numCols = df.columns.length
// dfCols is Array[Array[Any]]
val dfCols = convertRowsToColumns(localDF, numCols)
dfCols.map { col =>
colToCSharpBytes(col)
}
}
def convertRowsToColumns(localDF: Array[Row], numCols: Int): Array[Array[Any]] = {
(0 until numCols).map { colIdx =>
localDF.map { row =>
row(colIdx)
}
}.toArray
}
def colToCSharpBytes(col: Array[Any]): Array[Byte] = {
val numRows = col.length
val bos = new ByteArrayOutputStream()
val dos = new DataOutputStream(bos)
SerDe.writeInt(dos, numRows)
col.map { item =>
val obj: Object = item.asInstanceOf[Object]
SerDe.writeObject(dos, obj)
}
bos.toByteArray()
}
def saveMode(mode: String): SaveMode = {
mode match {
case "append" => SaveMode.Append
case "overwrite" => SaveMode.Overwrite
case "error" => SaveMode.ErrorIfExists
case "ignore" => SaveMode.Ignore
}
}
def loadDF(
sqlContext: SQLContext,
source: String,
options: java.util.Map[String, String]): DataFrame = {
sqlContext.read.format(source).options(options).load()
}
def loadDF(
sqlContext: SQLContext,
source: String,
schema: StructType,
options: java.util.Map[String, String]): DataFrame = {
sqlContext.read.format(source).schema(schema).options(options).load()
}
def loadDF(
sqlContext: SQLContext,
source: String,
schema: StructType): DataFrame = {
sqlContext.read.format(source).schema(schema).load()
}
def loadTextFile(sqlContext: SQLContext, path: String, hasHeader: Boolean, inferSchema: Boolean) : DataFrame = {
var dfReader = sqlContext.read.format("com.databricks.spark.csv")
if (hasHeader)
{
dfReader = dfReader.option("header", "true")
}
if (inferSchema)
{
dfReader = dfReader.option("inferSchema", "true")
}
dfReader.load(path)
}
def loadTextFile(sqlContext: SQLContext, path: String, delimiter: String, schema: StructType) : DataFrame = {
val stringRdd = sqlContext.sparkContext.textFile(path)
val rowRdd = stringRdd.map{s =>
val columns = s.split(delimiter)
columns.length match {
case 1 => RowFactory.create(columns(0))
case 2 => RowFactory.create(columns(0),columns(1))
case 3 => RowFactory.create(columns(0),columns(1),columns(2))
case 4 => RowFactory.create(columns(0),columns(1),columns(2),columns(3))
case 5 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4))
case 6 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5))
case 7 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6))
case 8 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7))
case 9 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8))
case 10 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9))
case 11 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10))
case 12 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11))
case 13 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12))
case 14 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13))
case 15 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14))
case 16 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15))
case 17 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16))
case 18 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17))
case 19 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18))
case 20 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19))
case 21 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20))
case 22 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21))
case 23 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22))
case 24 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23))
case 25 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24))
case 26 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24),columns(25))
case 27 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24),columns(25),columns(26))
case 28 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24),columns(25),columns(26),columns(27))
case 29 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24),columns(25),columns(26),columns(27),columns(28))
case 30 => RowFactory.create(columns(0),columns(1),columns(2),columns(3),columns(4),columns(5),columns(6),columns(7),columns(8),columns(9),columns(10),columns(11),columns(12),columns(13),columns(14),columns(15),columns(16),columns(17),columns(18),columns(19),columns(20),columns(21),columns(22),columns(23),columns(24),columns(25),columns(26),columns(27),columns(28),columns(29))
case _ => throw new Exception("Text files with more than 30 columns currently not supported") //TODO - if requirement comes up, generate code for additional columns
}
}
sqlContext.createDataFrame(rowRdd, schema)
}
}

118
scripts/sparkclr-submit.cmd Normal file
Просмотреть файл

@ -0,0 +1,118 @@
@echo off
setlocal enabledelayedexpansion
if "%SPARK_HOME%" == "" goto :sparkhomeerror
if "%JAVA_HOME%" == "" goto :javahomeerror
if "%SPARKCLR_HOME%" == "" goto :sparkclrhomeerror
if "%SPARK_CONF_DIR%" == "" (
SET SPARK_CONF_DIR=%SPARK_HOME%\conf
)
call %SPARK_HOME%\bin\load-spark-env.cmd
rem Test that an argument was given
if "x%1"=="x" (
goto :usage
)
set ASSEMBLY_DIR=%SPARK_HOME%\lib
for %%d in (%ASSEMBLY_DIR%\spark-assembly*hadoop*.jar) do (
set SPARK_ASSEMBLY_JAR=%%d
)
if "%SPARK_ASSEMBLY_JAR%"=="0" (
echo Failed to find Spark assembly JAR.
exit /b 1
)
set SPARKCLR_JAR=csharp-spark-1.4.1-SNAPSHOT.jar
set SPARKCLR_CLASSPATH=%SPARKCLR_HOME%\lib\%SPARKCLR_JAR%
set LAUNCH_CLASSPATH=%SPARK_ASSEMBLY_JAR%;%SPARKCLR_CLASSPATH%
set SPARKCLR_SUBMIT_CLASS=org.apache.spark.deploy.csharp.SparkCLRSubmit
set SPARK_SUBMIT_CLASS=org.apache.spark.deploy.SparkSubmit
set JARS=%SPARKCLR_CLASSPATH%
if not "%SPARKCSV_JARS%" == "" (
SET JARS=%JARS%,%SPARKCSV_JARS%
)
if not "%CSHARPSPARK_APP_JARS%" == "" (
SET JARS=%JARS%,%CSHARPSPARK_APP_JARS%
)
if "%1"=="debug" (
goto :debugmode
)
rem The launcher library prints the command to be executed in a single line suitable for being
rem executed by the batch interpreter. So read all the output of the launcher into a variable.
set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
%JAVA_HOME%\bin\java -cp %LAUNCH_CLASSPATH% org.apache.spark.launcher.Main %SPARK_SUBMIT_CLASS% --jars %JARS% --class %SPARKCLR_SUBMIT_CLASS% %* > %LAUNCHER_OUTPUT%
REM *********************************************************************************
REM ** TODO ** - replace the following sections in the script with a functionality that is implemented in scala
REM ** TODO ** - that will call org.apache.spark.launcher.Main, perform class name substituition, do classpath prefixing and generate the command to run
REM Following sections are simply a hack to leverage existing Spark artifacts - this will also help keeping CSharpSpark aligned with Spark's approach for arg parsing etc.
REM Following block replaces SparkSubmit with SparkCLRSubmit
REM *********************************************************************************
set LAUNCHER_OUTPUT_TEMP=sparkclr-submit-temp.txt
for /f "tokens=* delims= " %%A in ( '"type %LAUNCHER_OUTPUT%"') do (
SET originalstring=%%A
SET modifiedstring=!originalstring:%SPARK_SUBMIT_CLASS%=%SPARKCLR_SUBMIT_CLASS%!
echo !modifiedstring! >> %LAUNCHER_OUTPUT_TEMP%
)
del %LAUNCHER_OUTPUT%
REM *********************************************************************************
REM *********************************************************************************
REM Following block prefixes classpath with SPARKCLR_JAR
REM *********************************************************************************
set LAUNCHER_OUTPUT_TEMP2=sparkclr-submit-temp2.txt
set CLASSPATH_SUBSTRING=-cp "
set UPDATED_CLASSPATH_SUBSTRING=-cp "%SPARKCLR_CLASSPATH%;
for /f "tokens=* delims= " %%A in ( '"type %LAUNCHER_OUTPUT_TEMP%"') do (
SET originalstring2=%%A
SET modifiedstring2=!originalstring2:%CLASSPATH_SUBSTRING%=%UPDATED_CLASSPATH_SUBSTRING%!
echo !modifiedstring2! >> %LAUNCHER_OUTPUT_TEMP2%
)
del %LAUNCHER_OUTPUT_TEMP%
REM *********************************************************************************
for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT_TEMP2%) do (
set SPARK_CMD=%%i
)
del %LAUNCHER_OUTPUT_TEMP2%
REM launches the Spark job with SparkCLRSubmit as the Main class
echo Command to run %SPARK_CMD%
%SPARK_CMD%
goto :eof
:debugmode
%JAVA_HOME%\bin\java -cp %LAUNCH_CLASSPATH% org.apache.spark.deploy.csharp.CSharpRunner debug
goto :eof
:sparkhomeerror
@echo Error - SPARK_HOME environment variable is not set
@echo Note that SPARK_HOME environment variable should not have trailing \
goto :eof
:javahomeerror
@echo Error - JAVA_HOME environment variable is not set
@echo Note that JAVA_HOME environment variable should not have trailing \
goto :eof
:sparkclrhomeerror
@echo Error - SPARKCLR_HOME environment variable is not set
@echo SPARKCLR_HOME need to be set to the folder path for csharp-spark*.jar
@echo Note that SPARKCLR_HOME environment variable should not have trailing \
goto :eof
:usage
@echo Error - usage error.
@echo Correct usage is as follows
@echo sparkclr-submit.cmd [--verbose] [--master local] [--name testapp] d:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar c:\sparkclrapp\driver\csdriver.exe arg1 arg2 arg3