Building Spark .NET on Ubuntu 18.04

Open Issues:

Building through Visual Studio Code

Pre-requisites:

If you already have all the pre-requisites, skip to the build steps below.

Download and install .NET Core 3.1 SDK - installing the SDK will add the dotnet toolchain to your path.
Install OpenJDK 8
- You can use the following command:
```
sudo apt install openjdk-8-jdk
```
- Verify you are able to run java from your command-line
  📙 Click to see sample java -version output
```
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
```
- If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command:
```
sudo update-alternatives --config java
```

Install Apache Maven 3.6.3+

Run the following command:

mkdir -p ~/bin/maven
cd ~/bin/maven
wget https://www-us.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
tar -xvzf apache-maven-3.6.3-bin.tar.gz
ln -s apache-maven-3.6.3 current
export M2_HOME=~/bin/maven/current
export PATH=${M2_HOME}/bin:${PATH}
source ~/.bashrc

Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the export lines to your ~/.bashrc file.

Verify you are able to run mvn from your command-line

📙 Click to see sample mvn -version output

Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: ~/bin/apache-maven-3.6.3
Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix"

Install Apache Spark 2.3+

Download Apache Spark 2.3+ and extract it into a local folder (e.g., ~/bin/spark-2.3.2-bin-hadoop2.7)
Add the necessary environment variables SPARK_HOME e.g., ~/bin/spark-2.3.2-bin-hadoop2.7/
```
export SPARK_HOME=~/bin/spark-2.3.2-hadoop2.7
export PATH="$SPARK_HOME/bin:$PATH"
source ~/.bashrc
```
Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the export lines to your ~/.bashrc file.

Verify you are able to run spark-shell from your command-line

📙 Click to see sample console output

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c

Please make sure you are able to run dotnet, java, mvn, spark-shell from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.

Building

For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., ~/dotnet.spark/

git clone https://github.com/dotnet/spark.git ~/dotnet.spark

Building Spark .NET Scala Extensions Layer

When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the Spark .NET Scala Source Code.

Let us now build the Spark .NET Scala extension layer. This is easy to do:

cd src/scala
mvn clean package

You should see JARs created for the supported Spark versions:

microsoft-spark-2-3/target/microsoft-spark-2-3_2.11-<version>.jar
microsoft-spark-2-4/target/microsoft-spark-2-4_2.11-<version>.jar
microsoft-spark-3-0/target/microsoft-spark-3-0_2.12-<version>.jar

Building .NET Sample Applications using .NET Core CLI

Build the Worker

cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/
dotnet publish -f netcoreapp3.1 -r linux-x64

📙 Click to see sample console output

user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r linux-x64
Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj.
  Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
  Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll
  Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.Worker.dll
  Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish/

Build the Samples

cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/
dotnet publish -f netcoreapp3.1 -r linux-x64

📙 Click to see sample console output

user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r linux-x64
Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
  Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj.
  Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.0/Microsoft.Spark.dll
  Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/Microsoft.Spark.CSharp.Examples.dll
  Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish/

Run Samples

Once you build the samples, you can use spark-submit to submit your .NET Core apps. Make sure you have followed the pre-requisites section and installed Apache Spark.

Set the DOTNET_WORKER_DIR or PATH environment variable to include the path where the Microsoft.Spark.Worker binary has been generated (e.g., ~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/linux-x64/publish)
Open a terminal and go to the directory where your app binary has been generated (e.g., ~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/linux-x64/publish)

Running your app follows the basic structure:

spark-submit \
  [--jars <any-jars-your-app-is-dependent-on>] \
  --class org.apache.spark.deploy.dotnet.DotnetRunner \
  --master local \
  <path-to-microsoft-spark-jar> \
  <path-to-your-app-binary> <argument(s)-to-your-app>

Here are some examples you can run:

Microsoft.Spark.Examples.Sql.Batch.Basic

spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
./Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json

Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount

spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999

Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)

spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test

Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)

spark-submit \
--jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test

Feel this experience is complicated? Help us by taking up Simplify User Experience for Running an App

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ubuntu-instructions.md

ubuntu-instructions.md

Building Spark .NET on Ubuntu 18.04

Table of Contents

Open Issues:

Pre-requisites:

Building

Building Spark .NET Scala Extensions Layer

Building .NET Sample Applications using .NET Core CLI

Run Samples

Files

ubuntu-instructions.md

Latest commit

History

ubuntu-instructions.md

File metadata and controls

Building Spark .NET on Ubuntu 18.04

Table of Contents

Open Issues:

Pre-requisites:

Building

Building Spark .NET Scala Extensions Layer

Building .NET Sample Applications using .NET Core CLI

Run Samples