Samples Revamp: Readme Changes (#322)

2019-12-11 09:08:50 -08:00 · 2019-12-11 09:08:50 -08:00 · 9a2903628a
--- a/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/README.md
@ -0,0 +1,53 @@
+# .NET for Apache Spark C# Samples: Machine Learning
+
+[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
+
+In the **Machine Learning** folder, we provide C# samples which will help you incorporate machine learning into your big data apps.
+We typically incorporate machine learning with big data to scale the training and/or prediction of machine learning algorithms.
+
+We incorporate machine learning into our .NET for Apache Spark apps by using [ML.NET](https://dot.net/ml), 
+an open source and cross-platform machine learning framework for .NET developers.
+
+For each sample, we have a folder than contains a C# app and a README.md explaining the sample.
+
+<table>
+ <tr>
+   <td width="25%">
+      <h4><b>Sample Name</b></h4>
+  </td>
+  <td>
+      <h4 width="35%"><b>Description</b></h4>
+  </td>
+  <td>
+      <h4><b>Link</b></h4>
+  </td>
+ </tr>
+ <tr>
+   <td width="25%">
+      <h4>Batch Sentiment Analysis</h4>
+  </td>
+  <td width="35%">
+  Determine if a batch of online reviews are positive or negative, using ML.NET.
+  </td>
+    <td>
+      <h4><a href="Sentiment">Sentiment</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+  <tr>
+   <td width="25%">
+      <h4>Streaming Sentiment Analysis</h4>
+  </td>
+  <td width="35%">
+  Determine if statements being produced live are positive or negative, using ML.NET.
+  </td>
+    <td>
+      <h4><a href="SentimentStream">SentimentStream</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+ </table>
+ 
+ ## Additional Resources
+
+To learn more about combining .NET for Apache Spark with machine learning, check out [this video](https://channel9.msdn.com/Series/NET-for-Apache-Spark-101/Sentiment-Analysis-with-NET-for-Apache-Spark-and-MLNET-Part-1) from the .NET for Apache Spark 101 video series to see a demo coded and ran live. 
+
+You can also [checkout the Spark + ML demos and explanation](https://youtu.be/ZWsYMQ0Sw1o?t=906) from the .NET for Apache Spark session at .NET Conf 2019!
--- a/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/README.md
@ -202,3 +202,9 @@ Check out the [full coding example](./Program.cs). You can also view a live vide
 Rather than performing batch processing (analyzing data that's already been stored), we can adapt our Spark + ML.NET app to instead perform real-time processing with structured streaming.

 Check out [SentimentStream](../SentimentStream) to see the adapted version of the sentiment analysis program that will determine the sentiment of text live as it's typed into a terminal.
+
+## Citations
+
+**UCI Machine Learning Repository citation:** Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
+
+**Sentiment Labelled Sentences Data Set citation:** 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
--- a/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/SentimentStream/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/MachineLearning/SentimentStream/README.md
@ -219,8 +219,14 @@ spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local

 ## Next Steps

-Checkout the [full coding example](./Program.cs). You can also view a live video explanation of ML.NET + .NET for Spark in the [Bringing Big Data Analytics through Apache Spark to .NET](https://youtu.be/ZWsYMQ0Sw1o?t=1358) session from **.NET Conf 2019.**
+Check out the [full coding example](./Program.cs). You can also view a live video explanation of ML.NET + .NET for Spark in the [Bringing Big Data Analytics through Apache Spark to .NET](https://youtu.be/ZWsYMQ0Sw1o?t=1358) session from **.NET Conf 2019.**

 Rather than performing real-time processing, we can adapt our Spark + ML.NET app to instead perform batch processing (analyzing data that's already been stored).

 Check out [Sentiment](../Sentiment) to see the adapted version of the sentiment analysis program that will determine the sentiment of text from a batch of online reviews.
+
+## Citations
+
+**UCI Machine Learning Repository citation:** Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
+
+**Sentiment Labelled Sentences Data Set citation:** 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
--- a/examples/Microsoft.Spark.CSharp.Examples/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/README.md
@ -0,0 +1,74 @@
+# .NET for Apache Spark C# Samples
+
+[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
+
+In the **Microsoft.Spark.CSharp.Examples** folder, we provide C# samples which will help you get started with .NET for Apache Spark
+and demonstrate how to infuse big data analytics into existing and new .NET apps. 
+
+There are three main types of samples/apps in the repo:
+
+* **[SQL/Batch](Sql/Batch):** .NET for Apache Spark apps that analyze batch data, or data that has already been produced/stored.
+
+* **[SQL/Streaming](Sql/Streaming):** .NET for Apache Spark apps that analyze structured streaming data, or data that is currently being produced live.
+
+* **[Machine Learning](MachineLearning):** .NET for Apache Spark apps infused with Machine Learning models based on [ML.NET](http://dot.net/ml),
+an open source and cross-platform machine learning framework.
+
+<table >
+  <tr>
+    <td align="middle" colspan="2"><b>Batch Processing</td>
+  </tr>
+  <tr>
+  <td align="middle"><a href="Sql/Batch/Basic.cs"><b>Basic.cs</a></b><br>A simple example demonstrating basic Spark SQL features.<br></td>
+  <td align="middle"><a href="Sql/Batch/Datasource.cs"><b>Datasource.cs</a></b><br>Example demonstrating reading from various data sources.<br></td>
+  </tr>
+  <tr>
+    <td align="middle"><a href="Sql/Batch/GitHubProjects.cs"><b>GitHubProjects.cs</a></b><br>Example analyzing GitHub projects data.<br></td>
+    <td align="middle"><a href="Sql/Batch/Logging.cs"><b>Logging.cs</a></b><br>Example demonstrating log processing.<br></td>
+  </tr>
+  <tr>
+    <td align="middle"><a href="Sql/Batch/VectorUdfs.cs"><b>VectorUdfs.cs</a></b><br>Example using vectorized UDFs to improve query performance.<br></td>
+  </tr>
+</table>
+
+<br>
+
+<table >
+  <tr>
+    <td align="middle" colspan="2"><b>Structured Streaming</td>
+  </tr>
+  <tr>
+    <td align="middle"><a href="Sql/Streaming/StructuredNetworkWordCount.cs"><b>StructuredNetworkWordCount.cs</a></b><br>Simple word count app that connects to and analyzes a live data stream (like netcat).<br></td>
+    <td align="middle"><a href="Sql/Streaming/StructuredNetworkWordCountWindowed.cs"><b>StructuredNetworkWordCountWindowed.cs</a></b><br>Windowed word count app.<br></td>
+  </tr>
+  <tr>
+    <td align="middle"><a href="Sql/Streaming/StructuredKafkaWordCount.cs"><b>StructuredKafkaWordCount.cs</a></b><br>Word count on data from Kafka.<br></td>
+      <td align="middle"><a href="Sql/Streaming/StructuredNetworkCharacterCount.cs"><b>StructuredNetworkCharacterCount.cs</a></b><br>Count number of characters in each string read from a stream, demonstrating the power of UDFs + stream processing.<br></td>
+  </tr>
+</table>
+
+<br>
+
+<table >
+  <tr>
+    <td align="middle" colspan="2"><b>Machine Learning</td>
+  </tr>
+  <tr>
+    <td align="middle"><a href="MachineLearning/Sentiment/Program.cs"><b>Batch Sentiment Analysis</a></b><br>Determine if a batch of online reviews are positive or negative, using ML.NET.<br></td>
+    <td align="middle"><a href="MachineLearning/SentimentStream/Program.cs"><b>Streaming Sentiment Analysis</a></b><br>Determine if statements being produced live are positive or negative, using ML.NET.<br></td>
+  </tr>
+</table>
+
+### Other Files in the Folder
+
+Beyond the sample apps, there are a few other files in the **Microsoft.Spark.CSharp.Examples** folder:
+
+* **IExample.cs:** A common interface each sample implements to help provide consistency when creating/running sample apps.
+> Note: When you create and run sample apps beyond this repository's project, you do not need to use IExample.cs - it just provides consistency for all the apps included in this repo.
+
+* **Microsoft.Spark.CSharp.Examples.csproj:** The C# project file necessary for building/running all sample apps. It includes target
+frameworks, assembly information, and references to other C# project files references in the sample apps.
+
+* **Program.cs:** A common entry-point when running our sample apps (it contains the Main method). Helps us print error messages in cases such as a project lacking the necessary arguments.
+
+* **README.md:** The doc you are currently reading.
--- a/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/README.md
@ -0,0 +1,85 @@
+# .NET for Apache Spark C# Samples: Batch
+
+[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
+
+In the **Batch** folder, we provide C# samples which will help you get started with one of the fundamental big data analytics scenarios:
+**batch processing.** Batch processing means we're analyzing data that has already been stored (such as in a database, csv, or text file).
+
+For each sample, we have a C# app and, for some of the more complex apps, a README.md explaining the sample.
+
+<table>
+ <tr>
+   <td width="25%">
+      <h4><b>Sample Name</b></h4>
+  </td>
+  <td>
+      <h4 width="35%"><b>Description</b></h4>
+  </td>
+  <td>
+      <h4><b>Links</b></h4>
+  </td>
+ </tr>
+ <tr>
+   <td width="25%">
+      <h4>Basic.cs</h4>
+  </td>
+  <td width="35%">
+  A simple example demonstrating basic Spark SQL features.
+  </td>
+    <td>
+      <h4><a href="Basic.cs">Basic.cs</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+  <tr>
+   <td width="25%">
+      <h4>Datasource.cs</h4>
+  </td>
+  <td width="35%">
+  Example demonstrating reading from various data sources.
+  </td>
+    <td>
+      <h4><a href="Datasource.cs">Datasource.cs</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+ <tr>
+   <td width="25%">
+      <h4>VectorUdfs.cs</h4>
+  </td>
+  <td width="35%">
+  Example using vectorized UDFs to improve query performance.
+  </td>
+    <td>
+      <h4><a href="VectorUdfs.cs">VectorUdfs.cs</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+ <tr>
+   <td width="25%">
+      <h4>GitHubProjects.cs</h4>
+  </td>
+  <td width="35%">
+  Example analyzing GitHub projects data.
+  </td>
+    <td>
+      <h4><a href="readmes/GitHubProjectsReadme.md">ReadMe</a> &nbsp;&nbsp;&nbsp;
+      <a href="GitHubProjects.cs">GitHubProjects.cs</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+  <tr>
+   <td width="25%">
+      <h4>Logging.cs</h4>
+  </td>
+  <td width="35%">
+  Example demonstrating log processing.
+  </td>
+    <td>
+      <h4><a href="readmes/LoggingReadme.md">ReadMe</a> &nbsp;&nbsp;&nbsp;
+      <a href="Logging.cs">Logging.cs</a> &nbsp; &nbsp;</h4>
+  </td>
+ </tr>
+ </table>
+
+## Additional Resources
+
+To learn more about batch processing with .NET for Apache Spark, check out [this video](https://channel9.msdn.com/Series/NET-for-Apache-Spark-101/Batch-Processing-with-NET-for-Apache-Spark) from the .NET for Apache Spark 101 video series to see the GitHub projects batch demo coded and ran live.
+
+You can also [check out the demos and explanation](https://youtu.be/ZWsYMQ0Sw1o?t=304) from the .NET for Apache Spark session at .NET Conf 2019!
--- a/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/projects_smaller.csv
+++ b/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/projects_smaller.csv
--- a/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/readmes/GitHubProjectsReadme.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/readmes/GitHubProjectsReadme.md
@ -16,7 +16,9 @@ or changing.

 The data used in this example was generated from [GHTorrent](http://ghtorrent.org/), which monitors all public GitHub events (such as info about projects, commits, and watchers), stores the events and their structure in databases, and then releases data collected over different time periods as downloadable archives. 

-The dataset used when creating this sample was [downloaded from the GHTorrent archives](http://ghtorrent.org/downloads.html). Specifically, the **projects.csv** file was extracted from one of the latest MySQL dumps. For analysis that only takes a few seconds in demos, projects.csv was shortened to only a few GB, and thus the dataset is called **projects_smaller.csv** throughout this sample.
+The dataset used when creating this sample was [downloaded from the GHTorrent archives](http://ghtorrent.org/downloads.html). Specifically, the **projects.csv** file was extracted from one of the latest MySQL dumps. For analysis that only takes a few seconds in demos, projects.csv was shortened and thus the dataset is called **[projects_smaller.csv](../projects_smaller.csv)** throughout this sample.
+
+The GHTorrent dataset is distributed under a dual licensing scheme ([Creative Commons +](https://wiki.creativecommons.org/wiki/CCPlus)). For non-commercial uses (including, but not limited to, educational, research or personal uses), the dataset is distributed under the [CC-BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/).

 ## Solution

@ -72,3 +74,5 @@ spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local
 ## Next Steps

 View the [full coding example](../GitHubProjects.cs) to see an example of prepping and analyzing GitHub data.
+
+You can also view a live video explanation of this app and batch processing overall in the [.NET for Apache Spark 101 video series](https://www.youtube.com/watch?v=i_NvL8p_KZg&list=PLdo4fOcmZ0oXklB5hhg1G1ZwOJTEjcQ5z&index=4&t=3s).
--- a/examples/Microsoft.Spark.CSharp.Examples/Sql/README.md
+++ b/examples/Microsoft.Spark.CSharp.Examples/Sql/README.md
@ -0,0 +1,12 @@
+# .NET for Apache Spark C# Samples: SQL
+
+[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
+
+In the **Sql** folder, we provide samples focusing on Spark SQL, which allows us to work with structured data. We can store and analyze data using
+the `DataFrame` API and SQL queries.
+
+There are two categories of .NET for Apache Spark Sql samples:
+
+* **[Batch](Batch):** .NET for Apache Spark apps that analyze batch data, or data that has already been produced/stored.
+
+* **[Streaming](Streaming):** .NET for Apache Spark apps that analyze structured streaming data, or data that is currently being produced live.
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1,14 @@
+# .NET for Apache Spark Samples
+
+[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
+
+In the **examples** folder, we provide samples which will help you get started with .NET for Apache Spark
+and demonstrate how to infuse big data analytics into existing and new .NET apps. 
+
+There are two broad categories of .NET for Apache Spark samples:
+
+* **[Microsoft.Spark.CSharp.Examples](Microsoft.Spark.CSharp.Examples):** Sample C# .NET for Apache Spark apps.
+
+* **[Microsoft.Spark.FSharp.Examples](Microsoft.Spark.FSharp.Examples):** Sample F# .NET for Apache Spark apps.
+
+**Note:** The samples in each of these folders fall under additional sub-categories, such as batch, streaming, and machine learning.