Skip to content

Latest commit

 

History

History
executable file
·
79 lines (50 loc) · 3.06 KB

spark.md

File metadata and controls

executable file
·
79 lines (50 loc) · 3.06 KB

alt text

Spark Design Architecture

  • Spark is a distributted computing platform mostly used in bigdata processing

Spark Streaming

From kafka to spark engine

alt text

  • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
  • DStreams can be created either from input data stream from sources such as Kafka,
  • Internally, a DStream is represented as a sequence of RDDs. alt text

RDD

How spark works

alt text

alt text

alt text

Quickstart guide

Download latest Apache Kafka distribution and un-tar it.

Start ZooKeeper server:

./bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka server:

./bin/kafka-server-start.sh config/server.properties

Create input topic:

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic input

Create output topic:

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic output

Start Kafka producer:

./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic input

Start Kafka consumer:

./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output

TODO: Who will control spark cluster

  • Default is spark standalone but we have better services alt text

REF

https://www.tutorialspoint.com/apache_kafka/apache_kafka_integration_spark.htm http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/