Smooth out some formatting wrinkles in README.

This commit is contained in:
jthelin 2015-11-05 23:58:18 -08:00
Родитель ca09c3ae7a
Коммит c5acaa99da
1 изменённых файлов: 87 добавлений и 46 удалений

133
README.md
Просмотреть файл

@ -1,6 +1,9 @@
# SparkCLR
SparkCLR (pronounced sparkler) adds C# language binding to Apache Spark enabling the implementation of Spark driver code and data processing operations in C#.
For example, the word count sample in Apache Spark can be implemented in C# as follows
[SparkCLR](https://github.com/Microsoft/SparkCLR) (pronounced Sparkler) adds C# language binding to [Apache Spark](https://spark.apache.org/), enabling the implementation of Spark driver code and data processing operations in C#.
For example, the word count sample in Apache Spark can be implemented in C# as follows :
```c#
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
@ -9,7 +12,9 @@ var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
var wordCountCollection = wordCounts.Collect();
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
```
A simple DataFrame application using TempTable may look like the following
A simple DataFrame application using TempTable may look like the following:
```c#
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
@ -27,7 +32,9 @@ var joinDataFrame = GetSqlContext().Sql(
joinDataFrame.ShowSchema();
joinDataFrame.Show();
```
A simple DataFrame application using DataFrame DSL may look like the following
A simple DataFrame application using DataFrame DSL may look like the following:
``` c#
// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
@ -41,59 +48,86 @@ var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> {
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
```
Refer to `SparkCLR\csharp\Samples` directory for complete samples
Refer to `SparkCLR\csharp\Samples` directory for complete samples.
## Documents
Refer to the docs @ https://github.com/Microsoft/SparkCLR/tree/master/docs
Refer to the [docs folder](https://github.com/Microsoft/SparkCLR/tree/master/docs).
## Building SparkCLR
### Prerequisites
* [Apache Maven](http://maven.apache.org) for spark-clr project implemented in scala
* MSBuild in Visual Studio 2013 and above
* .NET Framework 4.5 and above
* [Nuget command-line utility](https://docs.nuget.org/release-notes) 3.2 and above
* [Apache Maven](http://maven.apache.org) for spark-clr project implemented in Scala.
* MSBuild in [Visual Studio](https://www.visualstudio.com/) 2013 and above.
* .NET Framework 4.5 and above.
* [Nuget command-line utility](https://docs.nuget.org/release-notes) 3.2 and above.
### Instructions
* Navigate to `SparkCLR\scala` directory and run the following command to build spark-clr*.jar
```
mvn package
```
* Start Developer Command Prompt for Visual Studio, navigate to `SparkCLR\csharp` directory, run the following commands to add `nuget.exe` to the path
```
set PATH=<fullpath to nuget.exe>;%PATH%
```
And build the rest of .Net binaries
```
build.cmd
```
* Optional. Under `SparkCLR\csharp` directory, run the following command to clean the .NET binaries built above
```
clean.cmd
```
* Navigate to `SparkCLR\scala` directory and run the following command to build spark-clr*.jar
```
mvn package
```
* Start Developer Command Prompt for Visual Studio, and navigate to `SparkCLR\csharp` directory.
- If `nuget.exe` is not already in your PATH, then run the following commands to add it.
```
set PATH=<fullpath to nuget.exe>;%PATH%
```
- Then build the rest of the .NET binaries
```
Build.cmd
```
* Optional: Under `SparkCLR\csharp` directory, run the following command to clean the .NET binaries built above
```
Clean.cmd
```
## Running Samples
### Prerequisites
DataFrame TextFile API uses `spark-csv` package to load data from CSV file. Latest [commons-csv-*.jar](http://commons.apache.org/proper/commons-csv/download_csv.cgi) and [spark-csv*.jar (Scala version:2.10)](http://spark-packages.org/package/databricks/spark-csv) should be downloaded manually.
DataFrame TextFile API uses `spark-csv` package to load data from CSV file.
Latest [commons-csv-*.jar](http://commons.apache.org/proper/commons-csv/download_csv.cgi) and [spark-csv*.jar (Scala version:2.10)](http://spark-packages.org/package/databricks/spark-csv) should be downloaded manually.
The following environment variables should be set properly:
* ```JAVA_HOME```
* ```SCALA_HOME```
* ```SPARKCSV_JARS``` should include fullpaths to `commons-csv*.jar` and `spark-csv*.jar`. For example:
```
set SPARKCSV_JARS=%SPARKCLR_HOME%\lib\commons-csv-1.2.jar;%SPARKCLR_HOME%\lib\spark-csv_2.10-1.2.0.jar
```
* ```SPARKCLR_HOME``` should point to a directory prapared with following subdirectories:
* **lib** (`spark-clr*.jar`)
* **bin** (`SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\[Debug|Release]\*`, including `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`, `SparkCLRSamples.exe`, `SparkCLRSamples.exe.Config` and etc.)
* **scripts** (`sparkclr-submit.cmd`)
* **data** (`SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\data\*`)
* ```SPARKCSV_JARS``` should include full paths to `commons-csv*.jar` and `spark-csv*.jar`.
For example:
```
set SPARKCSV_JARS=%SPARKCLR_HOME%\lib\commons-csv-1.2.jar;%SPARKCLR_HOME%\lib\spark-csv_2.10-1.2.0.jar
```
* ```SPARKCLR_HOME``` should point to a directory prepared with following sub-directories:
* **lib** ( `spark-clr*.jar` )
* **bin** ( The contents of `SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\[Debug|Release]\*`, including `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`, `SparkCLRSamples.exe`, `SparkCLRSamples.exe.Config` etc. )
* **scripts** ( `sparkclr-submit.cmd` )
* **data** ( `SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\data\*` )
### Running in Local mode
Set `CSharpWorkerPath` in `SparkCLRSamples.exe.config` and run the following. Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
Set `CSharpWorkerPath` in `SparkCLRSamples.exe.config` and run the following command:
```
sparkclr-submit.cmd --verbose %SPARKCLR_HOME%\lib\spark-clr-1.4.1-SNAPSHOT.jar %SPARKCLR_HOME%\bin\SparkCLRSamples.exe spark.local.dir C:\temp\SparkCLRTemp sparkclr.sampledata.loc %SPARKCLR_HOME%\data
```
Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
Setting `spark.local.dir` parameter is important. When local Spark instance distributes SparkCLR driver executables to Windows `%TEMP%` directory, anti-virus software may detect and report the executables showed up in `%TEMP%` directory as malware.
### Running in Standalone cluster mode
@ -102,21 +136,28 @@ sparkclr-submit.cmd --verbose %SPARKCLR_HOME%\lib\spark-clr-1.4.1-SNAPSHOT.jar
```
### Running in YARN mode
To be added
## Running Unit Tests
* In Visual Studio: "Test" -> "Run" -> "All Tests"
* In Developer Command Prompt for VS, navigate to `SparkCLR\csharp` and run the following command
```
test.cmd
```
* In Developer Command Prompt for VS, navigate to `SparkCLR\csharp` and run the following command:
```
Test.cmd
```
## Debugging Tips
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver
For example, to debug SparkCLR samples
* Launch CSharpBackend using ```sparkclr-submit.cmd debug``` and get the port number displayed in the console
* Navigate to `csharp/Samples/Microsoft.Spark.CSharp` and edit `App.Config` to use the port number from the previous step for CSharpBackendPortNumber config and also set CSharpWorkerPath config
* Run `SparkCLRSamples.exe` in Visual Studio
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver.
For example, to debug SparkCLR samples:
* Launch CSharpBackend.exe using ```sparkclr-submit.cmd debug``` and get the port number displayed in the console.
* Navigate to `csharp/Samples/Microsoft.Spark.CSharp` and edit `App.Config` to use the port number from the previous step for `CSharpBackendPortNumber` config and also set `CSharpWorkerPath` config values.
* Run `SparkCLRSamples.exe` in Visual Studio.
## License
SparkCLR is licensed under the MIT license. See LICENSE file in the project root for full license information.