If you already have all the pre-requisites, skip to the build steps below.
- Download and install .NET Core 3.1 SDK - installing the SDK will add the
dotnet
toolchain to your path. - Install OpenJDK 8
-
You can use the following command:
sudo apt install openjdk-8-jdk
-
Verify you are able to run
java
from your command-line📙 Click to see sample java -version output
openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
-
If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command:
sudo update-alternatives --config java
-
- Install Apache Maven 3.6.3+
-
Run the following command:
mkdir -p ~/bin/maven cd ~/bin/maven wget https://www-us.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz tar -xvzf apache-maven-3.6.3-bin.tar.gz ln -s apache-maven-3.6.3 current export M2_HOME=~/bin/maven/current export PATH=${M2_HOME}/bin:${PATH} source ~/.bashrc
Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the
export
lines to your~/.bashrc
file. -
Verify you are able to run
mvn
from your command-line📙 Click to see sample mvn -version output
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f) Maven home: ~/bin/apache-maven-3.6.3 Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en_US, platform encoding: ANSI_X3.4-1968 OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix"
-
- Install Apache Spark 2.3+
-
Download Apache Spark 2.3+ and extract it into a local folder (e.g.,
~/bin/spark-2.3.2-bin-hadoop2.7
) -
Add the necessary environment variables
SPARK_HOME
e.g.,~/bin/spark-2.3.2-bin-hadoop2.7/
export SPARK_HOME=~/bin/spark-2.3.2-hadoop2.7 export PATH="$SPARK_HOME/bin:$PATH" source ~/.bashrc
Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the
export
lines to your~/.bashrc
file. -
Verify you are able to run
spark-shell
from your command-line📙 Click to see sample console output
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
-
Please make sure you are able to run dotnet
, java
, mvn
, spark-shell
from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.
For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., ~/dotnet.spark/
git clone https://github.com/dotnet/spark.git ~/dotnet.spark
When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the Spark .NET Scala Source Code.
Let us now build the Spark .NET Scala extension layer. This is easy to do:
cd src/scala
mvn clean package
You should see JARs created for the supported Spark versions:
microsoft-spark-2-3/target/microsoft-spark-2-3_2.11-<version>.jar
microsoft-spark-2-4/target/microsoft-spark-2-4_2.11-<version>.jar
microsoft-spark-3-0/target/microsoft-spark-3-0_2.12-<version>.jar
-
Build the Worker
cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/ dotnet publish -f netcoreapp3.1 -r linux-x64
📙 Click to see sample console output
user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r linux-x64 Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj. Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.Worker.dll Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish/
-
Build the Samples
cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/ dotnet publish -f netcoreapp3.1 -r linux-x64
📙 Click to see sample console output
user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r linux-x64 Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj. Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.CSharp.Examples.dll Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish/
Once you build the samples, you can use spark-submit
to submit your .NET Core apps. Make sure you have followed the pre-requisites section and installed Apache Spark.
-
Set the
DOTNET_WORKER_DIR
orPATH
environment variable to include the path where theMicrosoft.Spark.Worker
binary has been generated (e.g.,~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish
) -
Open a terminal and go to the directory where your app binary has been generated (e.g.,
~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish
) -
Running your app follows the basic structure:
spark-submit \ [--jars <any-jars-your-app-is-dependent-on>] \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ <path-to-microsoft-spark-jar> \ <path-to-your-app-binary> <argument(s)-to-your-app>
Here are some examples you can run:
- Microsoft.Spark.Examples.Sql.Batch.Basic
spark-submit \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \ ./Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
- Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount
spark-submit \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \ ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
- Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)
spark-submit \ --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \ ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)
spark-submit \ --jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \ ./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- Microsoft.Spark.Examples.Sql.Batch.Basic
Feel this experience is complicated? Help us by taking up Simplify User Experience for Running an App