C# and F# language binding and extensions to Apache Spark
Перейти к файлу
tqin d214f949d6 Fix the problem that DFTextFileLoadDataFrameSample fail randomly 2015-12-07 19:05:00 -08:00
csharp removing validate method in mocks due to the limitation on the validation it performs 2015-12-07 11:06:22 -08:00
docs Added CONTRIBUTING.md to explain how to contribute code to SparkCLR 2015-11-11 17:18:04 -08:00
scala Fix the problem that DFTextFileLoadDataFrameSample fail randomly 2015-12-07 19:05:00 -08:00
scripts enabled "provided" flag in pom.xml, when creating uber package 2015-12-03 23:00:16 -08:00
.gitattributes Add standard .gitattributes file, to avoid line ending problems. 2015-11-05 11:06:08 -08:00
.gitignore add runsamples; proceed with build even not dev command prompt for VS 2015-12-02 06:03:54 -08:00
Build.cmd workaround to build uber pacakge in build.cmd using shade-plugin, without breaking debug mode in IntelliJ 2015-12-02 16:38:47 -08:00
CONTRIBUTING.md Update CONTRIBUTING.md 2015-12-02 00:18:06 -08:00
LICENSE initial commit 2015-10-29 15:27:15 -07:00
README.md adding note on running build.cmd before running runsamples.cmd 2015-12-04 14:24:48 -08:00
RunSamples.cmd Fix PowerShell System.UnauthorizedAccessException issue when execute RunSamples.cmd 2015-12-02 12:32:53 +08:00
appveyor.yml Revert "Revert "Update to nunit3 as appveyor is upgraded"" 2015-12-06 21:30:50 -08:00
downloadtools.ps1 fixing typo 2015-12-03 11:00:36 -08:00
precheck.cmd add runsamples; proceed with build even not dev command prompt for VS 2015-12-02 06:03:54 -08:00

README.md

SparkCLR

SparkCLR (pronounced Sparkler) adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");  

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " + 
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

Refer to SparkCLR\csharp\Samples directory for complete samples.

Documents

Refer to the docs folder.

Building SparkCLR

Build status

Prerequisites

JDK should be downloaded manually, and the following environment variables should be set properly in the Developer Command Prompt for Visual Studio:

  • JAVA_HOME

Instructions

  • In the Developer Command Prompt for Visual Studio where JAVA_HOME is set properly, navigate to SparkCLR directory:

    Build.cmd  
    
  • Optional:

    • Under SparkCLR\scala directory, run the following command to clean spark-clr*.jar built above:

      mvn clean
      
    • Under SparkCLR\csharp directory, run the following command to clean the .NET binaries built above:

      Clean.cmd  
      

Build.cmd downloads necessary build tools; after the build is done, it prepares the folowing directories under SparkCLR\run:

  • lib ( spark-clr*.jar )
  • bin ( Microsoft.Spark.CSharp.Adapter.dll, CSharpWorker.exe)
  • samples ( The contents of SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\Release\*, including Microsoft.Spark.CSharp.Adapter.dll, CSharpWorker.exe, SparkCLRSamples.exe, SparkCLRSamples.exe.Config etc. )
  • scripts ( sparkclr-submit.cmd )
  • data ( SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\data\* )

Running Samples

Prerequisites

JDK should be downloaded manually, and the following environment variables should be set properly in the Developer Command Prompt for Visual Studio:

  • JAVA_HOME

Running in Local mode

In the Developer Command Prompt for Visual Studio where JAVA_HOME is set properly, navigate to SparkCLR directory:

Runsamples.cmd  

It is required to run Build.cmd prior to running Runsamples.cmd.

Runsamples.cmd downloads Apache Spark 1.4.1, sets up SPARK_HOME environment variable, points SPARKCLR_HOME to SparkCLR\run directory created by Build.cmd, and invokes sparkclr-submit.cmd, with spark.local.dir set to SparkCLR\run\Temp.

A few more Runsamples.cmd examples:

  • To display all options supported by Runsamples.cmd:

    Runsamples.cmd  --help
    
  • To run PiSample only:

    Runsamples.cmd  --torun pi*
    
  • To run PiSample in verbose mode, with all logs displayed at console:

    Runsamples.cmd  --torun pi* --verbose
    

Running in Standalone mode

sparkclr-submit.cmd --verbose --master spark://host:port --exe SparkCLRSamples.exe  %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata

Running in YARN mode

sparkclr-submit.cmd --verbose --master yarn-cluster --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata

Running Unit Tests

  • In Visual Studio: "Test" -> "Run" -> "All Tests"

  • In Developer Command Prompt for VS, navigate to SparkCLR\csharp and run the following command:

    Test.cmd
    

Debugging Tips

CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver.

For example, to debug SparkCLR samples:

  • Launch CSharpBackend.exe using sparkclr-submit.cmd debug and get the port number displayed in the console.
  • Navigate to csharp/Samples/Microsoft.Spark.CSharp and edit App.Config to use the port number from the previous step for CSharpBackendPortNumber config and also set CSharpWorkerPath config values.
  • Run SparkCLRSamples.exe in Visual Studio.

License

License

SparkCLR is licensed under the MIT license. See LICENSE file for full license information.

Contribution

Issue Stats Issue Stats

We welcome contributions. To contribute, follow the instructions in CONTRIBUTING.md.