fix format in readme, incorproate feedback
This commit is contained in:
Родитель
b14977f751
Коммит
431892a33d
109
README.md
109
README.md
|
@ -2,41 +2,44 @@
|
|||
SparkCLR (pronounced sparkler) adds C# language binding to Apache Spark enabling the implementation of Spark driver code and data processing operations in C#.
|
||||
For example, the word count sample in Apache Spark can be implemented in C# as follows
|
||||
```c#
|
||||
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
|
||||
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
|
||||
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
|
||||
.ReduceByKey((x, y) => x + y);
|
||||
var wordCountCollection = wordCounts.Collect();
|
||||
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
|
||||
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
|
||||
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
|
||||
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
|
||||
.ReduceByKey((x, y) => x + y);
|
||||
var wordCountCollection = wordCounts.Collect();
|
||||
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
|
||||
```
|
||||
A simple DataFrame application using TempTable may look like the following
|
||||
```c#
|
||||
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
|
||||
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
|
||||
requestsDataFrame.RegisterTempTable("requests");
|
||||
metricsDateFrame.RegisterTempTable("metrics");
|
||||
// C0 - guid in requests DF, C3 - guid in metrics DF
|
||||
var join = GetSqlContext().Sql(
|
||||
"SELECT joinedtable.datacenter, MAX(joinedtable.latency) maxlatency, AVG(joinedtable.latency) avglatency " +
|
||||
"FROM (SELECT a.C1 as datacenter, b.C6 as latency " +
|
||||
"FROM requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
|
||||
"GROUP BY datacenter");
|
||||
join.ShowSchema();
|
||||
join.Show();
|
||||
var requestDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
|
||||
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
|
||||
requestDataFrame.RegisterTempTable("requests");
|
||||
metricsDateFrame.RegisterTempTable("metrics");
|
||||
// C0 - guid in requests DF, C3 - guid in metrics DF
|
||||
var join = GetSqlContext().Sql(
|
||||
"SELECT joinedtable.datacenter" +
|
||||
", MAX(joinedtable.latency) maxlatency" +
|
||||
", AVG(joinedtable.latency) avglatency " +
|
||||
"FROM (" +
|
||||
"SELECT a.C1 as datacenter, b.C6 as latency " +
|
||||
"FROM requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
|
||||
"GROUP BY datacenter");
|
||||
join.ShowSchema();
|
||||
join.Show();
|
||||
```
|
||||
A simple DataFrame application using DataFrame DSL may look like the following
|
||||
``` c#
|
||||
// C0 - guid, C1 - datacenter
|
||||
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
|
||||
.Select("C0", "C1");
|
||||
// C3 - guid, C6 - latency
|
||||
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
|
||||
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
|
||||
var joinDataFrame = requestsDataFrame.Join(metricsDateFrame, requestsDataFrame["C0"] == metricsDateFrame["C3"])
|
||||
.GroupBy("C1");
|
||||
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
|
||||
maxLatencyByDcDataFrame.ShowSchema();
|
||||
maxLatencyByDcDataFrame.Show();
|
||||
// C0 - guid, C1 - datacenter
|
||||
var requestDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
|
||||
.Select("C0", "C1");
|
||||
// C3 - guid, C6 - latency
|
||||
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
|
||||
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
|
||||
var joinDataFrame = requestDataFrame.Join(metricsDateFrame, requestDataFrame["C0"] == metricsDateFrame["C3"])
|
||||
.GroupBy("C1");
|
||||
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
|
||||
maxLatencyByDcDataFrame.ShowSchema();
|
||||
maxLatencyByDcDataFrame.Show();
|
||||
```
|
||||
Refer to SparkCLR\csharp\Samples directory for complete samples
|
||||
|
||||
|
@ -52,18 +55,22 @@ Refer to the docs @ https://github.com/Microsoft/SparkCLR/tree/master/docs
|
|||
|
||||
### Instructions
|
||||
* Navigate to SparkCLR\scala directory and run the following command to build spark-clr*.jar
|
||||
```Batchfile
|
||||
mvn package
|
||||
```
|
||||
* Start Developer Command Prompt for Visual Studio, navigate to SparkCLR\csharp directory, run the following commands to add nuget.exe to the path and build the rest of .Net binaries
|
||||
```Batchfile
|
||||
set PATH=<fullpath to nuget.exe>;%PATH%
|
||||
build.cmd
|
||||
```
|
||||
* Under SparkCLR|csharp directory, run the following command to clean the .NET binaries built above
|
||||
```Batchfile
|
||||
clean.cmd
|
||||
```
|
||||
```
|
||||
mvn package
|
||||
```
|
||||
* Start Developer Command Prompt for Visual Studio, navigate to SparkCLR\csharp directory, run the following commands to add nuget.exe to the path
|
||||
```
|
||||
set PATH=<fullpath to nuget.exe>;%PATH%
|
||||
```
|
||||
And build the rest of .Net binaries
|
||||
```
|
||||
build.cmd
|
||||
```
|
||||
* Under SparkCLR\csharp directory, run the following command to clean the .NET binaries built above
|
||||
```
|
||||
clean.cmd
|
||||
```
|
||||
|
||||
## Running Samples
|
||||
### Prerequisites
|
||||
Set the following environment variables
|
||||
|
@ -81,21 +88,15 @@ Directory pointed by ```SPARKCLR_HOME``` should have the following directories a
|
|||
|
||||
### Running in Local mode
|
||||
Set ```CSharpWorkerPath``` in SparkCLRSamples.exe.config and run the following. Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
|
||||
|
||||
```Batchfile
|
||||
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar^
|
||||
D:\SparkCLRHome\SparkCLRSamples.exe^
|
||||
spark.local.dir D:\temp\SparkCLRTemp^
|
||||
sparkclr.sampledata.loc D:\SparkCLRHome\data
|
||||
```
|
||||
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe spark.local.dir D:\temp\SparkCLRTemp sparkclr.sampledata.loc D:\SparkCLRHome\data
|
||||
```
|
||||
|
||||
Setting spark.local.dir parameter is optional and it is useful if local setup of Spark uses %TEMP% directory in windows to which adding SparkCLR driver exe file may cause problems (AV programs might automatically delete executables placed in these directories)
|
||||
|
||||
### Running in Standalone cluster mode
|
||||
```Batchfile
|
||||
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar^
|
||||
D:\SparkCLRHome\SparkCLRSamples.exe^
|
||||
sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
|
||||
```
|
||||
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
|
||||
```
|
||||
|
||||
### Running in YARN mode
|
||||
|
@ -104,9 +105,9 @@ To be added
|
|||
## Running Unit Tests
|
||||
* In Visual Studio: "Test" -> "Run" -> "All Tests"
|
||||
* In Developer Command Prompt for VS, navigate to SparkCLR\csharp and run the following command
|
||||
```Batchfile
|
||||
test.cmd
|
||||
```
|
||||
```
|
||||
test.cmd
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver
|
||||
|
|
Загрузка…
Ссылка в новой задаче