Smooth out some formatting wrinkles in README.
This commit is contained in:
Родитель
ca09c3ae7a
Коммит
c5acaa99da
133
README.md
133
README.md
|
@ -1,6 +1,9 @@
|
|||
# SparkCLR
|
||||
SparkCLR (pronounced sparkler) adds C# language binding to Apache Spark enabling the implementation of Spark driver code and data processing operations in C#.
|
||||
For example, the word count sample in Apache Spark can be implemented in C# as follows
|
||||
|
||||
[SparkCLR](https://github.com/Microsoft/SparkCLR) (pronounced Sparkler) adds C# language binding to [Apache Spark](https://spark.apache.org/), enabling the implementation of Spark driver code and data processing operations in C#.
|
||||
|
||||
For example, the word count sample in Apache Spark can be implemented in C# as follows :
|
||||
|
||||
```c#
|
||||
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
|
||||
var words = lines.FlatMap(s => s.Split(new[] { " " }, StringSplitOptions.None));
|
||||
|
@ -9,7 +12,9 @@ var wordCounts = words.Map(w => new KeyValuePair<string, int>(w.Trim(), 1))
|
|||
var wordCountCollection = wordCounts.Collect();
|
||||
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
|
||||
```
|
||||
A simple DataFrame application using TempTable may look like the following
|
||||
|
||||
A simple DataFrame application using TempTable may look like the following:
|
||||
|
||||
```c#
|
||||
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
|
||||
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
|
||||
|
@ -27,7 +32,9 @@ var joinDataFrame = GetSqlContext().Sql(
|
|||
joinDataFrame.ShowSchema();
|
||||
joinDataFrame.Show();
|
||||
```
|
||||
A simple DataFrame application using DataFrame DSL may look like the following
|
||||
|
||||
A simple DataFrame application using DataFrame DSL may look like the following:
|
||||
|
||||
``` c#
|
||||
// C0 - guid, C1 - datacenter
|
||||
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
|
||||
|
@ -41,59 +48,86 @@ var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> {
|
|||
maxLatencyByDcDataFrame.ShowSchema();
|
||||
maxLatencyByDcDataFrame.Show();
|
||||
```
|
||||
Refer to `SparkCLR\csharp\Samples` directory for complete samples
|
||||
|
||||
Refer to `SparkCLR\csharp\Samples` directory for complete samples.
|
||||
|
||||
## Documents
|
||||
Refer to the docs @ https://github.com/Microsoft/SparkCLR/tree/master/docs
|
||||
|
||||
Refer to the [docs folder](https://github.com/Microsoft/SparkCLR/tree/master/docs).
|
||||
|
||||
## Building SparkCLR
|
||||
|
||||
### Prerequisites
|
||||
* [Apache Maven](http://maven.apache.org) for spark-clr project implemented in scala
|
||||
* MSBuild in Visual Studio 2013 and above
|
||||
* .NET Framework 4.5 and above
|
||||
* [Nuget command-line utility](https://docs.nuget.org/release-notes) 3.2 and above
|
||||
|
||||
* [Apache Maven](http://maven.apache.org) for spark-clr project implemented in Scala.
|
||||
* MSBuild in [Visual Studio](https://www.visualstudio.com/) 2013 and above.
|
||||
* .NET Framework 4.5 and above.
|
||||
* [Nuget command-line utility](https://docs.nuget.org/release-notes) 3.2 and above.
|
||||
|
||||
### Instructions
|
||||
* Navigate to `SparkCLR\scala` directory and run the following command to build spark-clr*.jar
|
||||
```
|
||||
mvn package
|
||||
```
|
||||
* Start Developer Command Prompt for Visual Studio, navigate to `SparkCLR\csharp` directory, run the following commands to add `nuget.exe` to the path
|
||||
```
|
||||
set PATH=<fullpath to nuget.exe>;%PATH%
|
||||
```
|
||||
And build the rest of .Net binaries
|
||||
```
|
||||
build.cmd
|
||||
```
|
||||
* Optional. Under `SparkCLR\csharp` directory, run the following command to clean the .NET binaries built above
|
||||
```
|
||||
clean.cmd
|
||||
```
|
||||
|
||||
* Navigate to `SparkCLR\scala` directory and run the following command to build spark-clr*.jar
|
||||
|
||||
```
|
||||
mvn package
|
||||
```
|
||||
|
||||
* Start Developer Command Prompt for Visual Studio, and navigate to `SparkCLR\csharp` directory.
|
||||
|
||||
- If `nuget.exe` is not already in your PATH, then run the following commands to add it.
|
||||
|
||||
```
|
||||
set PATH=<fullpath to nuget.exe>;%PATH%
|
||||
```
|
||||
|
||||
- Then build the rest of the .NET binaries
|
||||
|
||||
```
|
||||
Build.cmd
|
||||
```
|
||||
|
||||
* Optional: Under `SparkCLR\csharp` directory, run the following command to clean the .NET binaries built above
|
||||
|
||||
```
|
||||
Clean.cmd
|
||||
```
|
||||
|
||||
## Running Samples
|
||||
|
||||
### Prerequisites
|
||||
DataFrame TextFile API uses `spark-csv` package to load data from CSV file. Latest [commons-csv-*.jar](http://commons.apache.org/proper/commons-csv/download_csv.cgi) and [spark-csv*.jar (Scala version:2.10)](http://spark-packages.org/package/databricks/spark-csv) should be downloaded manually.
|
||||
|
||||
DataFrame TextFile API uses `spark-csv` package to load data from CSV file.
|
||||
Latest [commons-csv-*.jar](http://commons.apache.org/proper/commons-csv/download_csv.cgi) and [spark-csv*.jar (Scala version:2.10)](http://spark-packages.org/package/databricks/spark-csv) should be downloaded manually.
|
||||
|
||||
The following environment variables should be set properly:
|
||||
|
||||
* ```JAVA_HOME```
|
||||
* ```SCALA_HOME```
|
||||
* ```SPARKCSV_JARS``` should include fullpaths to `commons-csv*.jar` and `spark-csv*.jar`. For example:
|
||||
```
|
||||
set SPARKCSV_JARS=%SPARKCLR_HOME%\lib\commons-csv-1.2.jar;%SPARKCLR_HOME%\lib\spark-csv_2.10-1.2.0.jar
|
||||
```
|
||||
* ```SPARKCLR_HOME``` should point to a directory prapared with following subdirectories:
|
||||
* **lib** (`spark-clr*.jar`)
|
||||
* **bin** (`SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\[Debug|Release]\*`, including `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`, `SparkCLRSamples.exe`, `SparkCLRSamples.exe.Config` and etc.)
|
||||
* **scripts** (`sparkclr-submit.cmd`)
|
||||
* **data** (`SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\data\*`)
|
||||
|
||||
* ```SPARKCSV_JARS``` should include full paths to `commons-csv*.jar` and `spark-csv*.jar`.
|
||||
|
||||
For example:
|
||||
```
|
||||
set SPARKCSV_JARS=%SPARKCLR_HOME%\lib\commons-csv-1.2.jar;%SPARKCLR_HOME%\lib\spark-csv_2.10-1.2.0.jar
|
||||
```
|
||||
|
||||
* ```SPARKCLR_HOME``` should point to a directory prepared with following sub-directories:
|
||||
|
||||
* **lib** ( `spark-clr*.jar` )
|
||||
* **bin** ( The contents of `SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\bin\[Debug|Release]\*`, including `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`, `SparkCLRSamples.exe`, `SparkCLRSamples.exe.Config` etc. )
|
||||
* **scripts** ( `sparkclr-submit.cmd` )
|
||||
* **data** ( `SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\data\*` )
|
||||
|
||||
### Running in Local mode
|
||||
Set `CSharpWorkerPath` in `SparkCLRSamples.exe.config` and run the following. Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
|
||||
|
||||
Set `CSharpWorkerPath` in `SparkCLRSamples.exe.config` and run the following command:
|
||||
|
||||
```
|
||||
sparkclr-submit.cmd --verbose %SPARKCLR_HOME%\lib\spark-clr-1.4.1-SNAPSHOT.jar %SPARKCLR_HOME%\bin\SparkCLRSamples.exe spark.local.dir C:\temp\SparkCLRTemp sparkclr.sampledata.loc %SPARKCLR_HOME%\data
|
||||
```
|
||||
|
||||
Note that SparkCLR jar version (**1.4.1**) should be aligned with Apache Spark version.
|
||||
|
||||
Setting `spark.local.dir` parameter is important. When local Spark instance distributes SparkCLR driver executables to Windows `%TEMP%` directory, anti-virus software may detect and report the executables showed up in `%TEMP%` directory as malware.
|
||||
|
||||
### Running in Standalone cluster mode
|
||||
|
@ -102,21 +136,28 @@ sparkclr-submit.cmd --verbose %SPARKCLR_HOME%\lib\spark-clr-1.4.1-SNAPSHOT.jar
|
|||
```
|
||||
|
||||
### Running in YARN mode
|
||||
|
||||
To be added
|
||||
|
||||
## Running Unit Tests
|
||||
|
||||
* In Visual Studio: "Test" -> "Run" -> "All Tests"
|
||||
* In Developer Command Prompt for VS, navigate to `SparkCLR\csharp` and run the following command
|
||||
```
|
||||
test.cmd
|
||||
```
|
||||
|
||||
* In Developer Command Prompt for VS, navigate to `SparkCLR\csharp` and run the following command:
|
||||
```
|
||||
Test.cmd
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver
|
||||
For example, to debug SparkCLR samples
|
||||
* Launch CSharpBackend using ```sparkclr-submit.cmd debug``` and get the port number displayed in the console
|
||||
* Navigate to `csharp/Samples/Microsoft.Spark.CSharp` and edit `App.Config` to use the port number from the previous step for CSharpBackendPortNumber config and also set CSharpWorkerPath config
|
||||
* Run `SparkCLRSamples.exe` in Visual Studio
|
||||
|
||||
CSharpBackend and C# driver are separately launched for debugging SparkCLR Adapter or driver.
|
||||
|
||||
For example, to debug SparkCLR samples:
|
||||
|
||||
* Launch CSharpBackend.exe using ```sparkclr-submit.cmd debug``` and get the port number displayed in the console.
|
||||
* Navigate to `csharp/Samples/Microsoft.Spark.CSharp` and edit `App.Config` to use the port number from the previous step for `CSharpBackendPortNumber` config and also set `CSharpWorkerPath` config values.
|
||||
* Run `SparkCLRSamples.exe` in Visual Studio.
|
||||
|
||||
## License
|
||||
|
||||
SparkCLR is licensed under the MIT license. See LICENSE file in the project root for full license information.
|
||||
|
|
Загрузка…
Ссылка в новой задаче