C# and F# language binding and extensions to Apache Spark
Перейти к файлу
Kaarthik Sivashanmugam 4a0ff8bf37 Merge pull request #187 from skaarthik/version
including workertest in code coverage
2015-12-22 13:05:34 -08:00
csharp Merge pull request #183 from jthelin/path-combine 2015-12-22 12:31:08 -08:00
docs Added CONTRIBUTING.md to explain how to contribute code to SparkCLR 2015-11-11 17:18:04 -08:00
scala Add rddId and stageId info in CSharpWorker's log 2015-12-22 11:59:59 +08:00
scripts updating Linux script for spark 1.5.2 upgrade 2015-12-17 11:12:14 -08:00
.gitattributes Add standard .gitattributes file, to avoid line ending problems. 2015-11-05 11:06:08 -08:00
.gitignore including workertest in code coverage 2015-12-22 12:27:49 -08:00
.travis.yml minor fix 2015-12-18 15:02:55 -08:00
Build.cmd Merge pull request #160 from tawan0109/dataframe 2015-12-14 14:41:52 +08:00
CONTRIBUTING.md Update CONTRIBUTING.md 2015-12-02 00:18:06 -08:00
LICENSE initial commit 2015-10-29 15:27:15 -07:00
PythonWorkerFactory.scala.patch add linux instructions to README 2015-12-11 17:05:13 -08:00
README.md updating version in Readme file 2015-12-18 15:31:14 -08:00
RunSamples.cmd spark 1.5.2 upgrade & unit test updates 2015-12-17 11:00:39 -08:00
appveyor.yml including workertest in code coverage 2015-12-22 12:27:49 -08:00
build.sh fix double quoting and typos; merge windows/linux nuspecs into one 2015-12-11 16:55:39 -08:00
downloadtools.ps1 spark 1.5.2 upgrade & unit test updates 2015-12-17 11:00:39 -08:00
linux-instructions.md Enable '--remote-sparkclr-jar' option for standalone cluster mode 2015-12-15 14:56:52 +08:00
precheck.cmd add runsamples; proceed with build even not dev command prompt for VS 2015-12-02 06:03:54 -08:00
run-samples.sh removing default mvn and exit code for build failure 2015-12-18 11:47:08 -08:00
windows-instructions.md Enable '--remote-sparkclr-jar' option for standalone cluster mode 2015-12-15 14:56:52 +08:00

README.md

SparkCLR

SparkCLR (pronounced Sparkler) adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");  

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " + 
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

Refer to SparkCLR\csharp\Samples directory for complete samples.

Documents

Refer to the docs folder.

Build Status

Ubuntu 14.04.3 LTS Windows
Build status Build status

Building, Running and Debugging SparkCLR

(Note: Tested only with Spark 1.5.2)

License

License

SparkCLR is licensed under the MIT license. See LICENSE file for full license information.

Contribution

Issue Stats Issue Stats

We welcome contributions. To contribute, follow the instructions in CONTRIBUTING.md.