Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull from nxbdi #2

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions HandsOnWalkThrough
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
val originalRDD = sc.parallelize(List("Data1", "Data2", "Data4", "Data5", "Data6", "Bacon", "Hortonworks", "Hadoop",
"Spark", "Tungesten",
"SQL"), 4)

val flatRDD = originalRDD.flatMap(_.split(" "))

originalRDD.collect().foreach(println)

originalRDD.count()

originalRDD.first()

originalRDD.take(2)

originalRDD.takeSample(true,5,7634184)

val mapped = originalRDD.mapPartitionsWithIndex{
(index, iterator) => {
println("Index -> " + index)
val myList = iterator.toList
myList.map(x => x + " -> " + index).iterator
}
}

mapped.take(5)

val rddSpark = sc.parallelize(List("SQL","Streaming","GraphX", "MLLib", "Bagel",
"SparkR","Python","Scala","Java", "Alluxio", "Tungsten", "Zeppelin"))

val rddHadoop = sc.parallelize(List("HDFS", "YARN", "TEZ", "Hive", "HBase", "Pig", "Atlas",
"Storm", "Accumulo", "Ranger", "Phoenix", "MapReduce", "Slider", "Flume", "Kafka", "Oozie", "Sqoop", "Falcon",
"Knox", "Ambari", "Zookeeper", "Cloudbreak", "SQL", "Java", "Scala", "Python"))

val bigDataRDD = rddHadoop.union(rddSpark)
bigDataRDD.collect()

rddHadoop.intersection(rddSpark).collect()

bigDataRDD.distinct().collect()

bigDataRDD.sample(true,0.25 ).collect()

bigDataRDD.count()

val keyValueRDD = sc.parallelize(Seq(
("Bacon", "Awesome"),
("PorkRoll", "Great"),
("Tofu", "Bogus")))

val groupByRDD = keyValueRDD.groupByKey()
groupByRDD.collect()

val kvRDD = sc.parallelize(Seq((1,"Bacon"), (1, "Hamburger"), (1,"Cheeseburger")))
val reducedByRDD = kvRDD.reduceByKey((a, b) => a.concat(b))
reducedByRDD.take(5)

keyValueRDD.countByKey().foreach(println)

keyValueRDD.saveAsTextFile("here")

keyValueRDD.saveAsObjectFile("here3")

keyValueRDD.saveAsSequenceFile("here2")

%sh ls

%sh ls here

cat here/part-00003

bigDataRDD.reduce((a, b) => a.concat(b))

val namesRDD = sc.parallelize(List((1, 25), (1, 27), (3, 25), (3, 27)))
val groupByRDD = namesRDD.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect()

val sortByRDD = namesRDD.sortByKey(true).collect()

val sortByRDD = namesRDD.sortByKey(false).collect()

val otherKeyValueRDD = sc.parallelize(Seq(
("Bacon", "Amazing"),
("Steak", "Fine"),
("Lettuce", "Sad")))

keyValueRDD.join(otherKeyValueRDD).collect()
keyValueRDD.leftOuterJoin(otherKeyValueRDD).collect()
keyValueRDD.rightOuterJoin(otherKeyValueRDD).collect()
keyValueRDD.fullOuterJoin(otherKeyValueRDD).collect()
keyValueRDD.groupWith(otherKeyValueRDD).collect()
keyValueRDD.cartesian(otherKeyValueRDD).collect()
keyValueRDD.pipe("cut -c2-4").collect()

keyValueRDD.coalesce(1).collect()

keyValueRDD.repartition(2).collect()

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val hereRDD = sc.textFile("here")
hereRDD.count
val df = hereRDD.toDF()
df.registerTempTable("here")
df.printSchema()

%sql

select * from here



// Comment: For access data logs, unzip before using
1 change: 1 addition & 0 deletions Logs.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Logs2.json

Large diffs are not rendered by default.

50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,52 @@
# SparkTransformations
Meetup

Tim Spann

ex-Pivotal Senior Field Engineer
DZONE MVB and Zone Leader
ex-Startup Senior Engineer / Team Lead

http://www.slideshare.net/bunkertor
http://sparkdeveloper.com/
http://www.twitter.com/PaasDev


Setup

Java JDK 8, Scala 2.10, SBT 0.13, Maven 3., Spark 1.6.0
http://www.oracle.com/technetwork/java/javase/downloads/index.html 
http://www.scala-lang.org/download/2.10.6.html
http://www.scala-lang.org/download/install.html
http://www.scala-sbt.org/download.html 
http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
http://spark.apache.org/downloads.html 
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
http://www.apache.org/dyn/closer.cgi/incubator/zeppelin/0.5.6-incubating/zeppelin-0.5.6-incubating-bin-all.tgz
For Mac (brew install sbt)




Running Zeppelin

https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/install/install.html
https://github.com/hortonworks-gallery/zeppelin-notebooks

Download the Apache Zeppelin binary (Mac and Linux)
zeppelin-0.5.6-incubating-bin-all
Unzip
Run
cd zeppelin-0.5.6-incubating-bin-all
./bin/zeppelin-daemon.sh start
http://localhost:8080/

export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SCALA_HOME={YOURDIR}/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin

For Windows, use SET instead of EXPORT and ; and not :.



Binary file added access.log.zip
Binary file not shown.
20,309 changes: 20,309 additions & 0 deletions access2.log

Large diffs are not rendered by default.

Binary file added access2.log.zip
Binary file not shown.
Binary file added access3.log.zip
Binary file not shown.