Samples Revamp: Readme Changes (#322)

This commit is contained in:
Brigit Murtaugh 2019-12-11 09:08:50 -08:00 коммит произвёл Terry Kim
Родитель c49875a4ca
Коммит 9a2903628a
9 изменённых файлов: 1257 добавлений и 2 удалений

Просмотреть файл

@ -0,0 +1,53 @@
# .NET for Apache Spark C# Samples: Machine Learning
[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
In the **Machine Learning** folder, we provide C# samples which will help you incorporate machine learning into your big data apps.
We typically incorporate machine learning with big data to scale the training and/or prediction of machine learning algorithms.
We incorporate machine learning into our .NET for Apache Spark apps by using [ML.NET](https://dot.net/ml),
an open source and cross-platform machine learning framework for .NET developers.
For each sample, we have a folder than contains a C# app and a README.md explaining the sample.
<table>
<tr>
<td width="25%">
<h4><b>Sample Name</b></h4>
</td>
<td>
<h4 width="35%"><b>Description</b></h4>
</td>
<td>
<h4><b>Link</b></h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>Batch Sentiment Analysis</h4>
</td>
<td width="35%">
Determine if a batch of online reviews are positive or negative, using ML.NET.
</td>
<td>
<h4><a href="Sentiment">Sentiment</a> &nbsp; &nbsp;</h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>Streaming Sentiment Analysis</h4>
</td>
<td width="35%">
Determine if statements being produced live are positive or negative, using ML.NET.
</td>
<td>
<h4><a href="SentimentStream">SentimentStream</a> &nbsp; &nbsp;</h4>
</td>
</tr>
</table>
## Additional Resources
To learn more about combining .NET for Apache Spark with machine learning, check out [this video](https://channel9.msdn.com/Series/NET-for-Apache-Spark-101/Sentiment-Analysis-with-NET-for-Apache-Spark-and-MLNET-Part-1) from the .NET for Apache Spark 101 video series to see a demo coded and ran live.
You can also [checkout the Spark + ML demos and explanation](https://youtu.be/ZWsYMQ0Sw1o?t=906) from the .NET for Apache Spark session at .NET Conf 2019!

Просмотреть файл

@ -202,3 +202,9 @@ Check out the [full coding example](./Program.cs). You can also view a live vide
Rather than performing batch processing (analyzing data that's already been stored), we can adapt our Spark + ML.NET app to instead perform real-time processing with structured streaming.
Check out [SentimentStream](../SentimentStream) to see the adapted version of the sentiment analysis program that will determine the sentiment of text live as it's typed into a terminal.
## Citations
**UCI Machine Learning Repository citation:** Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
**Sentiment Labelled Sentences Data Set citation:** 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

Просмотреть файл

@ -219,8 +219,14 @@ spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local
## Next Steps
Checkout the [full coding example](./Program.cs). You can also view a live video explanation of ML.NET + .NET for Spark in the [Bringing Big Data Analytics through Apache Spark to .NET](https://youtu.be/ZWsYMQ0Sw1o?t=1358) session from **.NET Conf 2019.**
Check out the [full coding example](./Program.cs). You can also view a live video explanation of ML.NET + .NET for Spark in the [Bringing Big Data Analytics through Apache Spark to .NET](https://youtu.be/ZWsYMQ0Sw1o?t=1358) session from **.NET Conf 2019.**
Rather than performing real-time processing, we can adapt our Spark + ML.NET app to instead perform batch processing (analyzing data that's already been stored).
Check out [Sentiment](../Sentiment) to see the adapted version of the sentiment analysis program that will determine the sentiment of text from a batch of online reviews.
## Citations
**UCI Machine Learning Repository citation:** Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
**Sentiment Labelled Sentences Data Set citation:** 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

Просмотреть файл

@ -0,0 +1,74 @@
# .NET for Apache Spark C# Samples
[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
In the **Microsoft.Spark.CSharp.Examples** folder, we provide C# samples which will help you get started with .NET for Apache Spark
and demonstrate how to infuse big data analytics into existing and new .NET apps.
There are three main types of samples/apps in the repo:
* **[SQL/Batch](Sql/Batch):** .NET for Apache Spark apps that analyze batch data, or data that has already been produced/stored.
* **[SQL/Streaming](Sql/Streaming):** .NET for Apache Spark apps that analyze structured streaming data, or data that is currently being produced live.
* **[Machine Learning](MachineLearning):** .NET for Apache Spark apps infused with Machine Learning models based on [ML.NET](http://dot.net/ml),
an open source and cross-platform machine learning framework.
<table >
<tr>
<td align="middle" colspan="2"><b>Batch Processing</td>
</tr>
<tr>
<td align="middle"><a href="Sql/Batch/Basic.cs"><b>Basic.cs</a></b><br>A simple example demonstrating basic Spark SQL features.<br></td>
<td align="middle"><a href="Sql/Batch/Datasource.cs"><b>Datasource.cs</a></b><br>Example demonstrating reading from various data sources.<br></td>
</tr>
<tr>
<td align="middle"><a href="Sql/Batch/GitHubProjects.cs"><b>GitHubProjects.cs</a></b><br>Example analyzing GitHub projects data.<br></td>
<td align="middle"><a href="Sql/Batch/Logging.cs"><b>Logging.cs</a></b><br>Example demonstrating log processing.<br></td>
</tr>
<tr>
<td align="middle"><a href="Sql/Batch/VectorUdfs.cs"><b>VectorUdfs.cs</a></b><br>Example using vectorized UDFs to improve query performance.<br></td>
</tr>
</table>
<br>
<table >
<tr>
<td align="middle" colspan="2"><b>Structured Streaming</td>
</tr>
<tr>
<td align="middle"><a href="Sql/Streaming/StructuredNetworkWordCount.cs"><b>StructuredNetworkWordCount.cs</a></b><br>Simple word count app that connects to and analyzes a live data stream (like netcat).<br></td>
<td align="middle"><a href="Sql/Streaming/StructuredNetworkWordCountWindowed.cs"><b>StructuredNetworkWordCountWindowed.cs</a></b><br>Windowed word count app.<br></td>
</tr>
<tr>
<td align="middle"><a href="Sql/Streaming/StructuredKafkaWordCount.cs"><b>StructuredKafkaWordCount.cs</a></b><br>Word count on data from Kafka.<br></td>
<td align="middle"><a href="Sql/Streaming/StructuredNetworkCharacterCount.cs"><b>StructuredNetworkCharacterCount.cs</a></b><br>Count number of characters in each string read from a stream, demonstrating the power of UDFs + stream processing.<br></td>
</tr>
</table>
<br>
<table >
<tr>
<td align="middle" colspan="2"><b>Machine Learning</td>
</tr>
<tr>
<td align="middle"><a href="MachineLearning/Sentiment/Program.cs"><b>Batch Sentiment Analysis</a></b><br>Determine if a batch of online reviews are positive or negative, using ML.NET.<br></td>
<td align="middle"><a href="MachineLearning/SentimentStream/Program.cs"><b>Streaming Sentiment Analysis</a></b><br>Determine if statements being produced live are positive or negative, using ML.NET.<br></td>
</tr>
</table>
### Other Files in the Folder
Beyond the sample apps, there are a few other files in the **Microsoft.Spark.CSharp.Examples** folder:
* **IExample.cs:** A common interface each sample implements to help provide consistency when creating/running sample apps.
> Note: When you create and run sample apps beyond this repository's project, you do not need to use IExample.cs - it just provides consistency for all the apps included in this repo.
* **Microsoft.Spark.CSharp.Examples.csproj:** The C# project file necessary for building/running all sample apps. It includes target
frameworks, assembly information, and references to other C# project files references in the sample apps.
* **Program.cs:** A common entry-point when running our sample apps (it contains the Main method). Helps us print error messages in cases such as a project lacking the necessary arguments.
* **README.md:** The doc you are currently reading.

Просмотреть файл

@ -0,0 +1,85 @@
# .NET for Apache Spark C# Samples: Batch
[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
In the **Batch** folder, we provide C# samples which will help you get started with one of the fundamental big data analytics scenarios:
**batch processing.** Batch processing means we're analyzing data that has already been stored (such as in a database, csv, or text file).
For each sample, we have a C# app and, for some of the more complex apps, a README.md explaining the sample.
<table>
<tr>
<td width="25%">
<h4><b>Sample Name</b></h4>
</td>
<td>
<h4 width="35%"><b>Description</b></h4>
</td>
<td>
<h4><b>Links</b></h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>Basic.cs</h4>
</td>
<td width="35%">
A simple example demonstrating basic Spark SQL features.
</td>
<td>
<h4><a href="Basic.cs">Basic.cs</a> &nbsp; &nbsp;</h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>Datasource.cs</h4>
</td>
<td width="35%">
Example demonstrating reading from various data sources.
</td>
<td>
<h4><a href="Datasource.cs">Datasource.cs</a> &nbsp; &nbsp;</h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>VectorUdfs.cs</h4>
</td>
<td width="35%">
Example using vectorized UDFs to improve query performance.
</td>
<td>
<h4><a href="VectorUdfs.cs">VectorUdfs.cs</a> &nbsp; &nbsp;</h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>GitHubProjects.cs</h4>
</td>
<td width="35%">
Example analyzing GitHub projects data.
</td>
<td>
<h4><a href="readmes/GitHubProjectsReadme.md">ReadMe</a> &nbsp;&nbsp;&nbsp;
<a href="GitHubProjects.cs">GitHubProjects.cs</a> &nbsp; &nbsp;</h4>
</td>
</tr>
<tr>
<td width="25%">
<h4>Logging.cs</h4>
</td>
<td width="35%">
Example demonstrating log processing.
</td>
<td>
<h4><a href="readmes/LoggingReadme.md">ReadMe</a> &nbsp;&nbsp;&nbsp;
<a href="Logging.cs">Logging.cs</a> &nbsp; &nbsp;</h4>
</td>
</tr>
</table>
## Additional Resources
To learn more about batch processing with .NET for Apache Spark, check out [this video](https://channel9.msdn.com/Series/NET-for-Apache-Spark-101/Batch-Processing-with-NET-for-Apache-Spark) from the .NET for Apache Spark 101 video series to see the GitHub projects batch demo coded and ran live.
You can also [check out the demos and explanation](https://youtu.be/ZWsYMQ0Sw1o?t=304) from the .NET for Apache Spark session at .NET Conf 2019!

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -16,7 +16,9 @@ or changing.
The data used in this example was generated from [GHTorrent](http://ghtorrent.org/), which monitors all public GitHub events (such as info about projects, commits, and watchers), stores the events and their structure in databases, and then releases data collected over different time periods as downloadable archives.
The dataset used when creating this sample was [downloaded from the GHTorrent archives](http://ghtorrent.org/downloads.html). Specifically, the **projects.csv** file was extracted from one of the latest MySQL dumps. For analysis that only takes a few seconds in demos, projects.csv was shortened to only a few GB, and thus the dataset is called **projects_smaller.csv** throughout this sample.
The dataset used when creating this sample was [downloaded from the GHTorrent archives](http://ghtorrent.org/downloads.html). Specifically, the **projects.csv** file was extracted from one of the latest MySQL dumps. For analysis that only takes a few seconds in demos, projects.csv was shortened and thus the dataset is called **[projects_smaller.csv](../projects_smaller.csv)** throughout this sample.
The GHTorrent dataset is distributed under a dual licensing scheme ([Creative Commons +](https://wiki.creativecommons.org/wiki/CCPlus)). For non-commercial uses (including, but not limited to, educational, research or personal uses), the dataset is distributed under the [CC-BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/).
## Solution
@ -72,3 +74,5 @@ spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local
## Next Steps
View the [full coding example](../GitHubProjects.cs) to see an example of prepping and analyzing GitHub data.
You can also view a live video explanation of this app and batch processing overall in the [.NET for Apache Spark 101 video series](https://www.youtube.com/watch?v=i_NvL8p_KZg&list=PLdo4fOcmZ0oXklB5hhg1G1ZwOJTEjcQ5z&index=4&t=3s).

Просмотреть файл

@ -0,0 +1,12 @@
# .NET for Apache Spark C# Samples: SQL
[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
In the **Sql** folder, we provide samples focusing on Spark SQL, which allows us to work with structured data. We can store and analyze data using
the `DataFrame` API and SQL queries.
There are two categories of .NET for Apache Spark Sql samples:
* **[Batch](Batch):** .NET for Apache Spark apps that analyze batch data, or data that has already been produced/stored.
* **[Streaming](Streaming):** .NET for Apache Spark apps that analyze structured streaming data, or data that is currently being produced live.

14
examples/README.md Normal file
Просмотреть файл

@ -0,0 +1,14 @@
# .NET for Apache Spark Samples
[.NET for Apache Spark](https://dot.net/spark) is a free, open-source, and cross-platform big data analytics framework.
In the **examples** folder, we provide samples which will help you get started with .NET for Apache Spark
and demonstrate how to infuse big data analytics into existing and new .NET apps.
There are two broad categories of .NET for Apache Spark samples:
* **[Microsoft.Spark.CSharp.Examples](Microsoft.Spark.CSharp.Examples):** Sample C# .NET for Apache Spark apps.
* **[Microsoft.Spark.FSharp.Examples](Microsoft.Spark.FSharp.Examples):** Sample F# .NET for Apache Spark apps.
**Note:** The samples in each of these folders fall under additional sub-categories, such as batch, streaming, and machine learning.