updated doc and packaging step

This commit is contained in:
Kaarthik Sivashanmugam 2016-04-26 21:37:36 -07:00
Родитель 763e4ef02a
Коммит 7982c8e4a6
6 изменённых файлов: 108 добавлений и 124 удалений

Просмотреть файл

@ -195,10 +195,21 @@ if not defined ProjectVersion (
)
set SPARKCLR_NAME=spark-clr_2.10-%ProjectVersion%
@echo "%SPARKCLR_HOME%
@rem copy samples to top-level folder before zipping
@echo move /Y "%SPARKCLR_HOME%\samples "%CMDHOME%"
move /Y %SPARKCLR_HOME%\samples %CMDHOME%
@echo move /Y "%SPARKCLR_HOME%\data" "%CMDHOME%\samples"
move /Y %SPARKCLR_HOME%\data %CMDHOME%\samples
@rem copy release info
@echo copy /Y "%CMDHOME%\..\notes\mobius-release-info.md"
copy /Y "%CMDHOME%\..\notes\mobius-release-info.md"
@rem Create the zip file
@echo 7z a .\target\%SPARKCLR_NAME%.zip runtime localmode examples
7z a .\target\%SPARKCLR_NAME%.zip runtime localmode examples
@echo 7z a .\target\test.zip runtime examples samples mobius-release-info.md
7z a .\target\test.zip runtime examples samples mobius-release-info.md
:distdone
popd

Просмотреть файл

@ -282,16 +282,34 @@ function Download-BuildTools
function Download-ExternalDependencies
{
# Downloading spark-csv package and its depenency. These packages are required for DataFrame operations in Mobius
$readMeStream = [System.IO.StreamWriter] "$scriptDir\..\dependencies\ReadMe.txt"
$readMeStream.WriteLine("The files in this folder are dependencies of Mobius Project")
$readMeStream.WriteLine("Refer to the following download locations for details on the jars like POM file, license etc.")
$readMeStream.WriteLine("")
$readMeStream.WriteLine("------------ Dependencies for CSV parsing in Mobius DataFrame API -----------------------------")
# Downloading spark-csv package and its depenency. These packages are required for DataFrame operations in Mobius
$url = "http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar"
$output="$scriptDir\..\dependencies\spark-csv_2.10-1.3.0.jar"
Download-File $url $output
Write-Output "[downloadtools.Download-ExternalDependencies] Downloading $url to $scriptDir\..\dependencies"
$readMeStream.WriteLine("$url")
$url = "http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar"
$output="$scriptDir\..\dependencies\commons-csv-1.1.jar"
Download-File $url $output
Write-Output "[downloadtools.Download-ExternalDependencies] Downloading $url to $scriptDir\..\dependencies"
$readMeStream.WriteLine("$url")
$readMeStream.WriteLine("")
$readMeStream.WriteLine("------------ Dependencies for Kafka-based processing in Mobius Streaming API -----------------------------")
$url = "http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-kafka-assembly_2.10/1.6.1/spark-streaming-kafka-assembly_2.10-1.6.1.jar"
$output="$scriptDir\..\dependencies\spark-streaming-kafka-assembly_2.10-1.6.1.jar"
Download-File $url $output
Write-Output "[downloadtools.Download-ExternalDependencies] Downloading $url to $scriptDir\..\dependencies"
$readMeStream.WriteLine("$url")
$readMeStream.close()
return
}

Просмотреть файл

@ -13,86 +13,9 @@ The following environment variables should be set properly:
* `JAVA_HOME`
## Instructions
* With `JAVA_HOME` set properly, navigate to [Mobius/build](../build) directory:
```
./build.sh
```
* Optional:
- Under [Mobius/scala](../scala) directory, run the following command to clean spark-clr*.jar built above:
```
mvn clean
```
- Under [Mobius/csharp](../csharp) directory, run the following command to clean the .NET binaries built above:
```
./clean.sh
```
[build.sh](../build/build.sh) prepares the following directories under `Mobius\build\runtime` after the build is done:
* **lib** ( `spark-clr*.jar` )
* **bin** ( `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`)
* **samples** ( The contents of `Mobius/csharp/Samples/Microsoft.Spark.CSharp/bin/Release/*`, including `Microsoft.Spark.CSharp.Adapter.dll`, `CSharpWorker.exe`, `SparkCLRSamples.exe`, `SparkCLRSamples.exe.Config` etc. )
* **scripts** ( `sparkclr-submit.sh` )
* **data** ( `Mobius/csharp/Samples/Microsoft.Spark.CSharp/data/*` )
# Running Samples
## Prerequisites
JDK is installed, and the following environment variables should be set properly:
* `JAVA_HOME`
## Running in Local mode
With `JAVA_HOME` set properly, navigate to [Mobius\build\localmode](../build/localmode) directory:
```
./run-samples.sh
```
It is **required** to run [build.sh](../build/build.sh) prior to running [run-samples.sh](../build/localmode/run-samples.sh).
[run-samples.sh](../build/localmode/run-samples.sh) downloads the version of Apache Spark referenced in the current branch, sets up `SPARK_HOME` environment variable, points `SPARKCLR_HOME` to `Mobius/build/runtime` directory created by [build.sh](../build/build.sh), and invokes [sparkclr-submit.sh](../scripts/sparkclr-submit.sh), with `spark.local.dir` set to `Mobius/build/runtime/Temp`.
A few more [run-samples.sh](../build/localmode/run-samples.sh) examples:
- To display all options supported by [run-samples.sh](../build/localmode/run-samples.sh):
```
run-samples.sh --help
```
- To run PiSample only:
```
run-samples.sh --torun pi*
```
- To run PiSample in verbose mode, with all logs displayed at console:
```
run-samples.sh --torun pi* --verbose
```
## Running in Standalone mode
```
sparkclr-submit.sh --verbose --master spark://host:port --exe SparkCLRSamples.exe $SPARKCLR_HOME/samples sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
```
- When option `--deploy-mode` is specified with `cluster`, option `--remote-sparkclr-jar` is required and needs to be specified with a valid file path of spark-clr*.jar on HDFS.
## Running in YARN mode
```
sparkclr-submit.sh --verbose --master yarn-cluster --exe SparkCLRSamples.exe $SPARKCLR_HOME/samples sparkclr.sampledata.loc hdfs://path/to/sparkclr/sampledata
```
Same as [instructions for Windows](windows-instructions.md#instructions) but use the following script files instead of .cmd files:
* build.sh
* clean.sh
# Running Unit Tests
@ -101,3 +24,9 @@ sparkclr-submit.sh --verbose --master yarn-cluster --exe SparkCLRSamples.exe $SP
./test.sh
```
# Running Samples
Same as [instructions for Windows](windows-instructions.md#running-samples) but using the following scripts instead of .cmd files:
* run-samples.sh
* sparkclr-submit.sh
Note that paths to files and syntax of the environment variables (like $SPARKCLR_HOME) will need to be updated for Linux when following the instructions for Windows.

Просмотреть файл

@ -2,27 +2,31 @@
The [release in GitHub](https://github.com/Microsoft/Mobius/releases) is a zip file. When you unzip that file, you will see a directory layout as follows:
````
|-- examples
|-- Example Mobius applications
|-- localmode
|-- Scripts for running samples and examples in local mode
|-- mobius-release-info.md
|-- runtime
|-- bin
|-- .NET binaries and its dependencies used by Mobius applications
|-- data
|-- Data files used by the [samples](..\csharp\Samples\Microsoft.Spark.CSharp)
|-- examples
|-- C# Spark driver [examples](..\examples) implemented using Mobius
|-- dependencies
|-- jar files Mobius depends on for functionality like CSV parsing, Kafka message processing etc.
|-- lib
|-- Mobius jar file
|-- samples
|-- C# Spark driver [samples](..\csharp\Samples\Microsoft.Spark.CSharp) for Moibus API
|-- scripts
|-- Mobius job submission scripts
|-- examples
|-- Example Mobius applications
|-- samples
|-- C# Spark driver samples for Mobius API
|-- data
|-- Data files used by the samples
````
You can run all the samples locally by invoking `localmode\RunSamples.cmd`. The script automatically downloads Apache Spark distribution and run the samples on your local machine. Note: Apache Spark distribution is a greater than 200 Mbytes download; `Runsamples.cmd` only downloads the Apache Spark distribution once.
[Mobius examples](..\examples) may have external dependencies and may need configuration settings to those dependencies before they can be run.
Instructions on running a Mobius app is available at https://github.com/skaarthik/Mobius/blob/master/notes/running-mobius-app.md
Mobius samples do not have any extenral dependencies. The dependent jar files and data files used by samples are included in the release. Instructions to run samples are available at
* https://github.com/skaarthik/Mobius/blob/master/notes/windows-instructions.md#running-samples for Windows
* https://github.com/skaarthik/Mobius/blob/master/notes/linx-instructions.md#running-samples for Linux
Mobius examples under "examples" folder may have external dependencies and may need configuration settings to those dependencies before they can be run. Refer to [Running Examples](https://github.com/Microsoft/Mobius/blob/master/notes/running-mobius-app.md#running-mobius-examples-in-local-mode) for details on how to run each example.
# NuGet Package
The packages published to [NuGet](https://www.nuget.org/packages/Microsoft.SparkCLR/) are primarily for references when building Mobius application. If Visual Studio is used for development. the reference to the NuGet package will go in packages.config file.

Просмотреть файл

@ -8,6 +8,18 @@ The following software need to be installed and appropriate environment variable
|winutils.exe | see [Running Hadoop on Windows](https://wiki.apache.org/hadoop/WindowsProblems) for details |HADOOP_HOME |Spark in Windows needs this utility in `%HADOOP_HOME%\bin` directory. It can be copied over from any Hadoop distribution. Alternative, if you used [`RunSamples.cmd`](../csharp/Samples/Microsoft.Spark.CSharp/samplesusage.md) to run Mobius samples, you can find `toos\winutils` directory (under [`build`](../build) directory) that can be used as HADOOP_HOME |
|Mobius |[v1.5.200](https://github.com/Microsoft/Mobius/releases) or v1.6.100-PREVIEW-1 | SPARKCLR_HOME |If you downloaded a [Mobius release](https://github.com/Microsoft/Mobius/releases), SPARKCLR_HOME should be set to the directory named `runtime` (for example, `D:\downloads\spark-clr_2.10-1.5.200\runtime`). Alternatively, if you used [`RunSamples.cmd`](../csharp/Samples/Microsoft.Spark.CSharp/samplesusage.md) to run Mobius samples, you can find `runtime` directory (under [`build`](../build) directory) that can be used as SPARKCLR_HOME. **Note** - setting SPARKCLR_HOME is _optional_ and it is set by sparkclr-submit.cmd if not set. |
## Dependencies
Some features in Mobius depend on classes outside of Spark and Mobius. A selected set of jar files that Mobius depends on are available in Mobius release under "runtime\dependencies" folder. These jar files are used with "--jars" parameter in Mobius (that is sparkclr-submit.cmd) and they get passed to Spark (spark-submit.cmd).
The following tables lists the Mobius features and their dependencies. The version numbers in the jar files below are just for completeness in names and a different version of the jar file may work with Mobius.
|Mobius Feature | Dependencies |
|----|-----|
|Using CSV files with DataFrame API | <ui><li>spark-csv_2.10-1.3.0.jar</li><li>commons-csv-1.1.jar</li></ui> |
|Kafka messages processing with DStream API | spark-streaming-kafka-assembly_2.10-1.6.1.jar |
Note that additional external jar files may need to be specificed as dependencies for a Mobius application depending on the Mobius features used (like EventHubs event processing or using Hive). These jars are not included in Mobius release under "dependencies" folder.
## Windows Instructions
### Local Mode
To use Mobius with Spark available locally in a machine, navigate to `%SPARKCLR_HOME%\scripts` directory and run the following command
@ -116,7 +128,7 @@ The instructions above cover running Mobius applications in Windows. With the fo
| Type | Examples |
| ------------- |--------------|
| Batch | <ul><li>[Pi](#pi-example-batch)</li><li>[Word Count](#wordcount-example-batch)</li></ul> |
| SQL | <ul><li>[JDBC](#jdbc-example-sql)</li><li>[Spark-XML](#spark-xml-example-sql)</li></ul> |
| SQL | <ul><li>[JDBC](#jdbc-example-sql)</li><li>[Spark-XML](#spark-xml-example-sql)</li><li>[Hive](#hive-example-sql)</li></ul> |
| Streaming | <ul><li>[Kafka](#kafka-example-streaming)</li><li>[EventHubs](#eventhubs-example-streaming)</li><li>[HDFS Word Count](#hdfswordcount-example-streaming)</li></ul> |
The following sample commands show how to run Mobius examples in local mode. Using the instruction above, the following sample commands can be tweaked to run in other modes
@ -142,6 +154,11 @@ The schema and row count of the table name provided as the commandline argument
Displays the number of XML elements in the input XML file provided as the first argument to SparkClrXml.exe and writes the modified XML to the file specified in the second commandline argument.
### Hive Example (Sql)
*
`sparkclr-submit.cmd --jars <jar files used for using Hive in Spark> --exe HiveDataFrame.exe C:\Git\Mobius\examples\Sql\HiveDataFrame\bin\Debug`
Reads data from a csv file, creates a Hive table and reads data from it
### EventHubs Example (Streaming)
* Get the following jar files
* qpid-amqp-1-0-client-0.32.jar

Просмотреть файл

@ -6,7 +6,7 @@
* Developer Command Prompt for [Visual Studio](https://www.visualstudio.com/) 2013 or above, which comes with .NET Framework 4.5 or above. Note: [Visual Studio 2015 Community Edition](https://www.visualstudio.com/en-us/products/visual-studio-community-vs.aspx) is **FREE**.
* 64-bit JDK 7u85 or above; or, 64-bit JDK 8u60 or above. OpenJDK for Windows can be downloaded from [http://www.azul.com/downloads/zulu/zulu-windows/](http://www.azul.com/downloads/zulu/zulu-windows/); Oracle JDK8 for Windows is available at Oracle website.
JDK should be downloaded manually, and the following environment variables should be set properly in the Developer Command Prompt for Visual Studio:
The following environment variables should be set properly in the Developer Command Prompt for Visual Studio:
* `JAVA_HOME`
@ -40,16 +40,43 @@ JDK should be downloaded manually, and the following environment variables shoul
* **scripts** ( `sparkclr-submit.cmd` )
* **data** ( `Mobius\csharp\Samples\Microsoft.Spark.CSharp\data\*` )
# Running Unit Tests
* In Visual Studio: Install NUnit3 Test Adapter. Run the tests through "Test" -> "Run" -> "All Tests"
* Install NUnit Runner 3.0 or above using NuGet (see [https://www.nuget.org/packages/NUnit.Runners/](https://www.nuget.org/packages/NUnit.Runners/)). In Developer Command Prompt for VS, set `NUNITCONSOLE` to the path to nunit console, and navigate to `Mobius\csharp` and run the following command:
```
Test.cmd
```
# Running Samples
Samples demonstrate comprehesive usage of Mobius API and also serve as functional tests for the API. Following are the options to run samples:
* [Local mode](#running-in-local-mode)
* [Standalone cluster](#running-in-standalone-mode)
* [YARN cluster](#running-in-yarn-mode)
* [Local mode dev environment](#running-in-local-mode-dev-environment) (using artifacts built in the local Git repo)
## Prerequisites
JDK should be downloaded manually, and the following environment variables should be set properly in the Developer Command Prompt for Visual Studio:
* `JAVA_HOME`
The prerequisites for running Mobius samples are same as the ones for running any other Mobius applications. Refer to [instructions](.\running-mobius-app.md#pre-requisites) for details on that. [Local mode dev environment](#running-in-local-mode-dev-environment) makes it easier to run samples in dev environment by downloading Spark.
## Running in Local mode
```
sparkclr-submit.cmd --verbose --jars c:\MobiusRelease\dependencies\spark-csv_2.10-1.3.0.jar,c:\MobiusRelease\dependencies\commons-csV-1.1.jar --exe SparkCLRSamples.exe c:\MobiusRelease\samples sparkclr.sampledata.loc c:\MobiusRelease\samples\data
```
## Running in Standalone mode
```
sparkclr-submit.cmd --verbose --master spark://host:port --jars <hdfs path to spark-csv_2.10-1.3.0.jar,commons-csv-1.1.jar> --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/mobius/sampledata
```
- When option `--deploy-mode` is specified with `cluster`, option `--remote-sparkclr-jar` is required and needs to be specified with a valid file path of spark-clr*.jar on HDFS.
## Running in YARN mode
```
sparkclr-submit.cmd --verbose --master yarn-cluster --jars <hdfs path to spark-csv_2.10-1.3.0.jar,commons-csv-1.1.jar> --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/mobius/sampledata
```
## Running in local mode dev environment
In the Developer Command Prompt for Visual Studio where `JAVA_HOME` is set properly, navigate to [Mobius\build](../build/) directory:
```
@ -78,25 +105,3 @@ A few more [RunSamples.cmd](../build/localmode/RunSamples.cmd) examples:
```
RunSamples.cmd --torun pi* --verbose
```
## Running in Standalone mode
```
sparkclr-submit.cmd --verbose --master spark://host:port --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/mobius/sampledata
```
- When option `--deploy-mode` is specified with `cluster`, option `--remote-sparkclr-jar` is required and needs to be specified with a valid file path of spark-clr*.jar on HDFS.
## Running in YARN mode
```
sparkclr-submit.cmd --verbose --master yarn-cluster --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples sparkclr.sampledata.loc hdfs://path/to/mobius/sampledata
```
# Running Unit Tests
* In Visual Studio: Install NUnit3 Test Adapter. Run the tests through "Test" -> "Run" -> "All Tests"
* Install NUnit Runner 3.0 or above using NuGet (see [https://www.nuget.org/packages/NUnit.Runners/](https://www.nuget.org/packages/NUnit.Runners/)). In Developer Command Prompt for VS, set `NUNITCONSOLE` to the path to nunit console, and navigate to `Mobius\csharp` and run the following command:
```
Test.cmd
```