Skip to content

Commit 7e4a3ad

Browse files
committed
Docs for Spark 2.1.0-2.2.0-1. Removed kerberos.md from file list, and fixed file path in merge script.
1 parent 696aa49 commit 7e4a3ad

File tree

29 files changed

+1260
-10
lines changed

29 files changed

+1260
-10
lines changed

pages/services/spark/2.1.0-2.2.0-1/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: layout.pug
33
navigationTitle: Spark 2.1.0-2.2.0-1
44
title: Spark 2.1.0-2.2.0-1
5-
menuWeight: 10
5+
menuWeight: 20
66
excerpt:
77
featureMaturity:
88
enterprise: false
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
post_title: Custom Docker Images
3+
menu_order: 95
4+
feature_maturity: ""
5+
enterprise: 'no'
6+
---
7+
8+
<!-- This source repo for this topic is https://github.com/mesosphere/dcos-commons -->
9+
10+
11+
**Note:** Customizing the Spark image Mesosphere provides is supported. However, customizations have the potential to adversely affect the integration between Spark and DC/OS. In situations where Mesosphere support suspects a customization may be adversely impacting Spark with DC/OS, Mesosphere support may request that the customer reproduce the issue with an unmodified
12+
Spark image.
13+
14+
You can customize the Docker image in which Spark runs by extending the standard Spark Docker image. In this way, you can install your own libraries, such as a custom Python library.
15+
16+
1. In your Dockerfile, extend from the standard Spark image and add your customizations:
17+
18+
```
19+
FROM mesosphere/spark:1.0.4-2.0.1
20+
RUN apt-get install -y python-pip
21+
RUN pip install requests
22+
```
23+
24+
1. Then, build an image from your Dockerfile.
25+
26+
docker build -t username/image:tag .
27+
docker push username/image:tag
28+
29+
1. Reference your custom Docker image with the `--docker-image` option when running a Spark job.
30+
31+
dcos spark run --docker-image=myusername/myimage:v1 --submit-args="http://external.website/mysparkapp.py 30"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
post_title: Fault Tolerance
3+
menu_order: 100
4+
feature_maturity: ""
5+
enterprise: 'no'
6+
---
7+
8+
<!-- This source repo for this topic is https://github.com/mesosphere/dcos-commons -->
9+
10+
11+
Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components:
12+
13+
- DC/OS Apache Spark Service
14+
- Batch Jobs
15+
- Streaming Jobs
16+
17+
# DC/OS Apache Spark Service
18+
19+
The DC/OS Apache Spark service runs in Marathon and includes the Mesos Cluster Dispatcher and the Spark History Server. The Dispatcher manages jobs you submit via `dcos spark run`. Job data is persisted to Zookeeper. The Spark History Server reads event logs from HDFS. If the service dies, Marathon will restart it, and it will reload data from these highly available stores.
20+
21+
# Batch Jobs
22+
23+
Batch jobs are resilient to executor failures, but not driver failures. The Dispatcher will restart a driver if you submit with `--supervise`.
24+
25+
## Driver
26+
27+
When the driver fails, executors are terminated, and the entire Spark application fails. If you submitted your job with `--supervise`, then the Dispatcher will restart the job.
28+
29+
## Executors
30+
31+
Batch jobs are resilient to executor failure. Upon failure, cached data, shuffle files, and partially computed RDDs are lost. However, Spark RDDs are fault-tolerant, and Spark will start a new executor to recompute this data from the original data source, caches, or shuffle files. There is a performance cost as data is recomputed, but an executor failure will not cause a job to fail.
32+
33+
# Streaming Jobs
34+
35+
Whereas batch jobs run once and can usually be restarted upon failure, streaming jobs often need to run constantly. The application must survive driver failures, often with no data loss.
36+
37+
In summary, to experience no data loss, you must run with the WAL enabled. The one exception is that, if you're consuming from Kafka, you can use the Direct Kafka API.
38+
39+
For exactly once processing semantics, you must use the Direct Kafka API. All other receivers provide at least once semantics.
40+
41+
## Failures
42+
43+
There are two types of failures:
44+
45+
- Driver
46+
- Executor
47+
48+
## Job Features
49+
50+
There are a few variables that affect the reliability of your job:
51+
52+
- [WAL][1]
53+
- [Receiver reliability][2]
54+
- [Storage level][3]
55+
56+
## Reliability Features
57+
58+
The two reliability features of a job are data loss and processing semantics. Data loss occurs when the source sends data, but the job fails to process it. Processing semantics describe how many times a received message is processed by the job. It can be either "at least once" or "exactly once"
59+
60+
### Data loss
61+
62+
A Spark Job loses data when delivered data does not get processed. The following is a list of configurations with increasing data preservation guarantees:
63+
64+
- Unreliable receivers
65+
66+
Unreliable receivers do not ack data they receive from the source. This means that buffered data in the receiver will be lost upon executor failure.
67+
68+
executor failure => **data loss** driver failure => **data loss**
69+
70+
- Reliable receivers, unreplicated storage level
71+
72+
This is an unusual configuration. By default, Spark Streaming receivers run with a replicated storage level. But if you happen reduce the storage level to be unreplicated, data stored on the receiver but not yet processed will not survive executor failure.
73+
74+
executor failure => **data loss**
75+
driver failure => **data loss**
76+
77+
- Reliable receivers, replicated storage level
78+
79+
This is the default configuration. Data stored in the receiver is replicated, and can thus survive a single executor failure. Driver failures, however, result in all executors failing, and therefore result in data loss.
80+
81+
(single) executor failure => **no data loss**
82+
driver failure => **data loss**
83+
84+
- Reliable receivers, WAL
85+
86+
With a WAL enabled, data stored in the receiver is written to a highly available store such as S3 or HDFS. This means that an app can recover from even a driver failure.
87+
88+
executor failure => **no data loss**
89+
driver failure => **no data loss**
90+
91+
- Direct Kafka Consumer, no checkpointing
92+
93+
Since Spark 1.3, The Spark+Kafka integration has supported an experimental Direct Consumer, which doesn't use traditional receivers. With the direct approach, RDDs read directly from kafka, rather than buffering data in receivers.
94+
95+
However, without checkpointing, driver restarts mean that the driver will start reading from the latest Kafka offset, rather than where the previous driver left off.
96+
97+
executor failure => **no data loss**
98+
driver failure => **data loss**
99+
100+
- Direct Kafka Consumer, checkpointing
101+
102+
With checkpointing enabled, Kafka offsets are stored in a reliable store such as HDFS or S3. This means that an application can restart exactly where it left off.
103+
104+
executor failure => **no data loss**
105+
driver failure => **no data loss**
106+
107+
### Processing semantics
108+
109+
Processing semantics apply to how many times received messages get processed. With Spark Streaming, this can be either "at least once" or "exactly once".
110+
111+
The semantics below describe apply to Spark's receipt of the data. To provide an end-to-end exactly-once guarantee, you must additionally verify that your output operation provides exactly-once guarantees. More info [here][4].
112+
113+
- Receivers
114+
115+
**at least once**
116+
117+
Every Spark Streaming consumer, with the exception of the Direct Kafka Consumer described below, uses receivers. Receivers buffer blocks of data in memory, then write them according to the storage level of the job. After writing out the data, it will send an ack to the source so the source knows not to resend. However, if this ack fails, or the node fails between writing out the data and sending the ack, then an inconsistency arises. Spark believes that the data has been received, but the source does not. This results in the source resending the data, and it being processed twice.
118+
119+
- Direct Kafka Consumer
120+
121+
**exactly once**
122+
123+
The Direct Kafka Consumer avoids the problem described above by reading directly from Kafka, and storing the offsets itself in the checkpoint directory.
124+
125+
More information [here][5].
126+
127+
128+
[1]: https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#requirements
129+
[2]: https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#with-receiver-based-sources
130+
[3]: http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
131+
[4]: http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
132+
[5]: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
post_title: Configure Spark for HDFS
3+
nav_title: HDFS
4+
menu_order: 20
5+
enterprise: 'no'
6+
---
7+
8+
<!-- This source repo for this topic is https://github.com/mesosphere/dcos-commons -->
9+
10+
11+
You can configure Spark for a specific HDFS cluster.
12+
13+
To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
14+
15+
```json
16+
{
17+
"hdfs": {
18+
"config-url": "http://mydomain.com/hdfs-config"
19+
}
20+
}
21+
```
22+
23+
For more information, see [Inheriting Hadoop Cluster Configuration][8].
24+
25+
For DC/OS HDFS, these configuration files are served at `http://<hdfs.framework-name>.marathon.mesos:<port>/v1/endpoints`, where `<hdfs.framework-name>` is a configuration variable set in the HDFS package, and `<port>` is the port of its marathon app.
26+
27+
### Spark Checkpointing
28+
29+
In order to use spark with checkpointing make sure you follow the instructions [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) and use an hdfs directory as the checkpointing directory. For example:
30+
```
31+
val checkpointDirectory = "hdfs://hdfs/checkpoint"
32+
val ssc = ...
33+
ssc.checkpoint(checkpointDirectory)
34+
```
35+
That hdfs directory will be automatically created on hdfs and spark streaming app will work from checkpointed data even in the presence of application restarts/failures.
36+
37+
# HDFS Kerberos
38+
39+
You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spark on Mesos.
40+
41+
## HDFS Configuration
42+
43+
After you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
44+
45+
## Installation
46+
47+
1. A `krb5.conf` file tells Spark how to connect to your KDC. Base64 encode this file:
48+
49+
cat krb5.conf | base64
50+
51+
1. Add the following to your JSON configuration file to enable Kerberos in Spark:
52+
53+
{
54+
"security": {
55+
"kerberos": {
56+
"krb5conf": "<base64 encoding>"
57+
}
58+
}
59+
}
60+
61+
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Apache Spark.
62+
63+
Base64 encode your keytab:
64+
65+
cat spark.keytab | base64
66+
67+
And add the following to your configuration file:
68+
69+
{
70+
"history-server": {
71+
"kerberos": {
72+
"principal": "spark@REALM",
73+
"keytab": "<base64 encoding>"
74+
}
75+
}
76+
}
77+
78+
1. Install Spark with your custom configuration, here called `options.json`:
79+
80+
dcos package install --options=options.json spark
81+
82+
## Job Submission
83+
84+
To authenticate to a Kerberos KDC, DC/OS Apache Spark supports keytab files as well as ticket-granting tickets (TGTs).
85+
86+
Keytabs are valid infinitely, while tickets can expire. Especially for long-running streaming jobs, keytabs are recommended.
87+
88+
### Keytab Authentication
89+
90+
Submit the job with the keytab:
91+
92+
dcos spark run --submit-args="\
93+
--kerberos-principal user@REALM \
94+
--keytab-secret-path /__dcos_base64__hdfs-keytab \
95+
--conf ... --class MySparkJob <url> <args>"
96+
97+
### TGT Authentication
98+
99+
Submit the job with the ticket:
100+
```$bash
101+
dcos spark run --submit-args="\
102+
--kerberos-principal hdfs/name-0-node.hdfs.autoip.dcos.thisdcos.directory@LOCAL \
103+
--tgt-secret-path /__dcos_base64__tgt \
104+
--conf ... --class MySparkJob <url> <args>"
105+
```
106+
107+
**Note:** These credentials are security-critical. The DC/OS Secret Store requires you to base64 encode binary secrets (such as the Kerberos keytab) before adding them. If they are uploaded with the `__dcos_base64__` prefix, they are automatically decoded when the secret is made available to your Spark job. If the secret name **doesn't** have this prefix, the keytab will be decoded and written to a file in the sandbox. This leaves the secret exposed and is not recommended. We also highly recommended configuring SSL encryption between the Spark components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
108+
109+
110+
[8]: http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
post_title: History Server
3+
menu_order: 30
4+
enterprise: 'no'
5+
---
6+
7+
<!-- This source repo for this topic is https://github.com/mesosphere/dcos-commons -->
8+
9+
10+
DC/OS Apache Spark includes The [Spark History Server][3]. Because the history server requires HDFS, you must explicitly enable it.
11+
12+
1. Install HDFS:
13+
14+
dcos package install hdfs
15+
16+
**Note:** HDFS requires 5 private nodes.
17+
18+
1. Create a history HDFS directory (default is `/history`). [SSH into your cluster][10] and run:
19+
20+
docker run -it mesosphere/hdfs-client:1.0.0-2.6.0 bash
21+
./bin/hdfs dfs -mkdir /history
22+
23+
1. Create `spark-history-options.json`:
24+
25+
{
26+
"hdfs-config-url": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints"
27+
}
28+
29+
1. Install The Spark History Server:
30+
31+
dcos package install spark-history --options=spark-history-options.json
32+
33+
1. Create `spark-dispatcher-options.json`;
34+
35+
{
36+
"service": {
37+
"spark-history-server-url": "http://<dcos_url>/service/spark-history"
38+
},
39+
"hdfs": {
40+
"config-url": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints"
41+
}
42+
}
43+
44+
1. Install the Spark dispatcher:
45+
46+
dcos package install spark --options=spark-dispatcher-options.json
47+
48+
1. Run jobs with the event log enabled:
49+
50+
dcos spark run --submit-args="--conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs/history ... --class MySampleClass http://external.website/mysparkapp.jar"
51+
52+
1. Visit your job in the dispatcher at `http://<dcos_url>/service/spark/`. It will include a link to the history server entry for that job.
53+
54+
[3]: http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
55+
[10]: https://dcos.io/docs/1.9/administering-clusters/sshcluster/

0 commit comments

Comments
 (0)