C# and F# language binding and extensions to Apache Spark

apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming

Перейти к файлу

skaarthik 8add53fa56 consolidating build scripts in one directory		2016-01-09 17:54:51 -08:00
build	consolidating build scripts in one directory	2016-01-09 17:54:51 -08:00
csharp	Modify samples to use simpler `String.Split` overload where possible	2016-01-09 13:15:57 +10:00
docs	adding streaming section to docs	2016-01-05 11:18:40 -08:00
examples	# This is a combination of 3 commits.	2015-12-22 17:31:39 -08:00
notes	consolidating build scripts in one directory	2016-01-09 17:54:51 -08:00
scala	Fix scalastyle check warnings	2016-01-08 13:12:05 +08:00
scripts	consolidating build scripts in one directory	2016-01-09 17:54:51 -08:00
.gitattributes	Add standard .gitattributes file, to avoid line ending problems.	2015-11-05 11:06:08 -08:00
.gitignore	moving streaming methods from top level proxy to streaming context proxy and other minor updates	2016-01-04 22:11:15 -08:00
.travis.yml	use xsltproc to generate documentation on linux	2016-01-08 02:02:32 -08:00
LICENSE	initial commit	2015-10-29 15:27:15 -07:00
PythonWorkerFactory.scala.patch	add linux instructions to README	2015-12-11 17:05:13 -08:00
README.md	consolidating build scripts in one directory	2016-01-09 17:54:51 -08:00
appveyor.yml	including workertest in code coverage	2015-12-22 12:27:49 -08:00

README.md

SparkCLR

SparkCLR (pronounced Sparkler) adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(' '));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " + 
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

Refer to SparkCLR\csharp\Samples directory and sample usage for complete samples.

Documents

Refer to the docs folder.

Build Status

Ubuntu 14.04.3 LTS	Windows

Building, Running and Debugging SparkCLR

(Note: Tested only with Spark 1.5.2)

License

SparkCLR is licensed under the MIT license. See LICENSE file for full license information.

Contribution

We welcome contributions. To contribute, follow the instructions in CONTRIBUTING.md.