spark/docs/developer-guide.md

4.5 KiB

Developer Guide

Table of Contents

How to Do Local Debugging

Debugging Spark .NET Application

Open a new command prompt window, run the following:

spark-submit \
  --class org.apache.spark.deploy.dotnet.DotnetRunner \
  --master local \
  <path-to-microsoft-spark-jar> \
  debug

and you will see the followng output:

***********************************************************************
* .NET Backend running debug mode. Press enter to exit *
***********************************************************************

In this debug mode, DotnetRunner does not launch the .NET application, but waits for it to connect. Leave this command prompt window open.

Now you can start your .NET application with a C# debugger (Visual Studio Debugger for Windows/macOS or C# Debugger Extension in Visual Code) to debug your application.

Debugging User Defined Function (UDF)

Note that this is currently supported only on Windows with Visual Studio Debugger.

Before running spark-submit, set the following environment variable:

set DOTNET_WORKER_DEBUG=1

Now, when you run your Spark application, a Choose Just-In-Time Debugger window will pop up. Choose a Visual Studio debugger.

The debugger will break at the following location in TaskRunner.cs:

if (EnvironmentUtils.GetEnvironmentVariableAsBool("DOTNET_WORKER_DEBUG"))
{
    Debugger.Launch(); // <-- The debugger will break here.
}

Now, navigate to the .cs file that contains the UDF that you plan to debug, and set a breakpoint. (The breakpoint will say The breakpoint will not currently be hit because the worker hasn't loaded the assembly that contains UDF yet.)

Hit F5 to continue your application and the breakpoint will eventually be hit.

Note that the Choose Just-In-Time Debugger window will pop-up for each task. Therefore, make sure to set the number of executors to a low number. For example, you can use --master local[1] option for spark-submit to set the number of tasks to 1, and hence launching a single debugger instance.

Debugging Scala code

If you need to debug the Scala side code (DotnetRunner, DotnetBackendHandler, etc.), you can use the following command, and attach a debugger to the running process using IntelliJ:

spark-submit \
  --driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 \
  --class org.apache.spark.deploy.dotnet.DotnetRunner \
  --master local \
  <path-to-microsoft-spark-jar> \
  <path-to-your-app-exe> <argument(s)-to-your-app>

How to Support New Spark Releases

We encourage developers to first read Apache Spark's Versioning Policy and Semantic Versioning to gain the most out of the instructions below.

At a high-level, Spark's versions are: [MAJOR].[FEATURE].[MAINTENANCE]. We will cover the upgrade path for each type of version separately below (in increasing order of effort required).

[MAINTENANCE]: Upgrading for a Patch Release Version

Since Apache Spark's [MAINTENANCE] releases involve only internal changes (e.g., bug fixes etc.), it is straightforward to upgrade the code base to support a [MAINTENANCE] release. The steps to do this are below:

  1. In the corresponding pom.xml, update the spark.version value to the newly released version.
  2. Update DotnetRunner.supportedSparkVersions to include the newly released version.
  3. Update the azure-pipelines.yml to include E2E testing for the newly released version.

Refer to this commit for an example.

[FEATURE]: Upgrading for a Minor Release Version

WIP

[MAJOR]: Upgrading for a Major Release Version

WIP