fix format in readme, incorproate feedback

This commit is contained in:
Daniel Li 2015-11-03 09:42:48 -08:00
Родитель b14977f751
Коммит 431892a33d
1 изменённых файлов: 55 добавлений и 54 удалений

109
README.md
Просмотреть файл

@ -2,41 +2,44 @@
SparkCLR (pronounced sparkler) adds C# language binding to Apache Spark enabling the implementation of Spark driver code and data processing operations in C#.
For example, the word count sample in Apache Spark can be implemented in C# as follows
```c#
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
.ReduceByKey((x, y) => x + y);
var wordCountCollection = wordCounts.Collect();
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
.ReduceByKey((x, y) => x + y);
var wordCountCollection = wordCounts.Collect();
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
```
A simple DataFrame application using TempTable may look like the following
```c#
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
requestsDataFrame.RegisterTempTable("requests");
metricsDateFrame.RegisterTempTable("metrics");
// C0 - guid in requests DF, C3 - guid in metrics DF
var join = GetSqlContext().Sql(
"SELECT joinedtable.datacenter, MAX(joinedtable.latency) maxlatency, AVG(joinedtable.latency) avglatency " +
"FROM (SELECT a.C1 as datacenter, b.C6 as latency " +
"FROM requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
"GROUP BY datacenter");
join.ShowSchema();
join.Show();
var requestDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
requestDataFrame.RegisterTempTable("requests");
metricsDateFrame.RegisterTempTable("metrics");
// C0 - guid in requests DF, C3 - guid in metrics DF
var join = GetSqlContext().Sql(
"SELECT joinedtable.datacenter" +
", MAX(joinedtable.latency) maxlatency" +
", AVG(joinedtable.latency) avglatency " +
"FROM (" +
"SELECT a.C1 as datacenter, b.C6 as latency " +
"FROM requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
"GROUP BY datacenter");
join.ShowSchema();
join.Show();
```
A simple DataFrame application using DataFrame DSL may look like the following
``` c#
// C0 - guid, C1 - datacenter
var requestsDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
.Select("C0", "C1");
// C3 - guid, C6 - latency
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = requestsDataFrame.Join(metricsDateFrame, requestsDataFrame["C0"] == metricsDateFrame["C3"])
.GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
// C0 - guid, C1 - datacenter
var requestDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
.Select("C0", "C1");
// C3 - guid, C6 - latency
var metricsDateFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = requestDataFrame.Join(metricsDateFrame, requestDataFrame["C0"] == metricsDateFrame["C3"])
.GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
```
Refer to SparkCLR\csharp\Samples directory for complete samples
@ -52,18 +55,22 @@ Refer to the docs @ https://github.com/Microsoft/SparkCLR/tree/master/docs
### Instructions
* Navigate to SparkCLR\scala directory and run the following command to build spark-clr*.jar
```Batchfile
mvn package
```
* Start Developer Command Prompt for Visual Studio, navigate to SparkCLR\csharp directory, run the following commands to add nuget.exe to the path and build the rest of .Net binaries
```Batchfile
set PATH=<fullpath to nuget.exe>;%PATH%
build.cmd
```
* Under SparkCLR|csharp directory, run the following command to clean the .NET binaries built above
```Batchfile
clean.cmd
```
```
mvn package
```
* Start Developer Command Prompt for Visual Studio, navigate to SparkCLR\csharp directory, run the following commands to add nuget.exe to the path
```
set PATH=<fullpath to nuget.exe>;%PATH%
```
And build the rest of .Net binaries
```
build.cmd
```
* Under SparkCLR\csharp directory, run the following command to clean the .NET binaries built above
```
clean.cmd
```
## Running Samples
### Prerequisites
Set the following environment variables
@ -81,21 +88,15 @@ Directory pointed by ```SPARKCLR_HOME``` should have the following directories a
### Running in Local mode
Set ```CSharpWorkerPath``` in SparkCLRSamples.exe.config and run the following. Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
```Batchfile
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar^
D:\SparkCLRHome\SparkCLRSamples.exe^
spark.local.dir D:\temp\SparkCLRTemp^
sparkclr.sampledata.loc D:\SparkCLRHome\data
```
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe spark.local.dir D:\temp\SparkCLRTemp sparkclr.sampledata.loc D:\SparkCLRHome\data
```
Setting spark.local.dir parameter is optional and it is useful if local setup of Spark uses %TEMP% directory in windows to which adding SparkCLR driver exe file may cause problems (AV programs might automatically delete executables placed in these directories)
### Running in Standalone cluster mode
```Batchfile
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar^
D:\SparkCLRHome\SparkCLRSamples.exe^
sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
```
sparkclr-submit.cmd --verbose D:\SparkCLRHome\lib\spark-clr-1.4.1-SNAPSHOT.jar D:\SparkCLRHome\SparkCLRSamples.exe sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
```
### Running in YARN mode
@ -104,9 +105,9 @@ To be added
## Running Unit Tests
* In Visual Studio: "Test" -> "Run" -> "All Tests"
* In Developer Command Prompt for VS, navigate to SparkCLR\csharp and run the following command
```Batchfile
test.cmd
```
```
test.cmd
```
## Debugging Tips
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver