Skip to content

Commit

Permalink
[SPARK-47152][SQL][BUILD] Provide CodeHaus Jackson dependencies via…
Browse files Browse the repository at this point in the history
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - apache#40893
  - apache#42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - apache#45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
dongjoon-hyun committed Feb 24, 2024
1 parent ee312ec commit dd3f81c
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 0 deletions.
17 changes: 17 additions & 0 deletions core/src/main/scala/org/apache/spark/internal/config/package.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.spark.internal

import java.io.File
import java.util.Locale
import java.util.concurrent.TimeUnit

Expand Down Expand Up @@ -64,8 +65,16 @@ package object config {
.stringConf
.createOptional

private[spark] val DRIVER_DEFAULT_EXTRA_CLASS_PATH =
ConfigBuilder(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH)
.internal()
.version("4.0.0")
.stringConf
.createWithDefault(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE)

private[spark] val DRIVER_CLASS_PATH =
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_CLASSPATH)
.withPrepended(DRIVER_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
.version("1.0.0")
.stringConf
.createOptional
Expand Down Expand Up @@ -254,8 +263,16 @@ package object config {
private[spark] val EXECUTOR_ID =
ConfigBuilder("spark.executor.id").version("1.2.0").stringConf.createOptional

private[spark] val EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
ConfigBuilder(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH)
.internal()
.version("4.0.0")
.stringConf
.createWithDefault(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE)

private[spark] val EXECUTOR_CLASS_PATH =
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
.withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
.version("1.0.0")
.stringConf
.createOptional
Expand Down
6 changes: 6 additions & 0 deletions dev/make-distribution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
# Copy jars
cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"

# Only create the hive-jackson directory if they exist.
for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
mkdir -p "$DISTDIR"/hive-jackson
mv $f "$DISTDIR"/hive-jackson/
done

# Only create the yarn directory if the yarn artifacts were built.
if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar ]; then
mkdir "$DISTDIR/yarn"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,8 @@ Map<String, String> getEffectiveConfig() throws IOException {
Properties p = loadPropertiesFile();
p.stringPropertyNames().forEach(key ->
effectiveConfig.computeIfAbsent(key, p::getProperty));
effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);
}
return effectiveConfig;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {

/** Configuration key for the driver memory. */
public static final String DRIVER_MEMORY = "spark.driver.memory";
/** Configuration key for the driver default extra class path. */
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
"spark.driver.defaultExtraClassPath";
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
/** Configuration key for the driver class path. */
public static final String DRIVER_EXTRA_CLASSPATH = "spark.driver.extraClassPath";
/** Configuration key for the default driver VM options. */
Expand All @@ -65,6 +69,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {

/** Configuration key for the executor memory. */
public static final String EXECUTOR_MEMORY = "spark.executor.memory";
/** Configuration key for the executor default extra class path. */
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
"spark.executor.defaultExtraClassPath";
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
/** Configuration key for the executor class path. */
public static final String EXECUTOR_EXTRA_CLASSPATH = "spark.executor.extraClassPath";
/** Configuration key for the default executor VM options. */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,12 @@ private List<String> buildSparkSubmitCommand(Map<String, String> env)
Map<String, String> config = getEffectiveConfig();
boolean isClientMode = isClientMode(config);
String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
if (extraClassPath == null || extraClassPath.trim().isEmpty()) {
extraClassPath = defaultExtraClassPath;
} else {
extraClassPath += File.pathSeparator + defaultExtraClassPath;
}

List<String> cmd = buildJavaCommand(extraClassPath);
// Take Thrift/Connect Server as daemon
Expand Down Expand Up @@ -498,6 +504,8 @@ protected boolean handle(String opt, String value) {
case DRIVER_MEMORY -> conf.put(SparkLauncher.DRIVER_MEMORY, value);
case DRIVER_JAVA_OPTIONS -> conf.put(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, value);
case DRIVER_LIBRARY_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, value);
case DRIVER_DEFAULT_CLASS_PATH ->
conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);
case DRIVER_CLASS_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_CLASSPATH, value);
case CONF -> {
checkArgument(value != null, "Missing argument to %s", CONF);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ class SparkSubmitOptionParser {
protected final String CONF = "--conf";
protected final String DEPLOY_MODE = "--deploy-mode";
protected final String DRIVER_CLASS_PATH = "--driver-class-path";
protected final String DRIVER_DEFAULT_CLASS_PATH = "--driver-default-class-path";
protected final String DRIVER_CORES = "--driver-cores";
protected final String DRIVER_JAVA_OPTIONS = "--driver-java-options";
protected final String DRIVER_LIBRARY_PATH = "--driver-library-path";
Expand Down Expand Up @@ -94,6 +95,7 @@ class SparkSubmitOptionParser {
{ DEPLOY_MODE },
{ DRIVER_CLASS_PATH },
{ DRIVER_CORES },
{ DRIVER_DEFAULT_CLASS_PATH },
{ DRIVER_JAVA_OPTIONS },
{ DRIVER_LIBRARY_PATH },
{ DRIVER_MEMORY },
Expand Down

0 comments on commit dd3f81c

Please sign in to comment.