Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port GoldiLocksFirstTry to java #54

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
238 commits
Select commit Hold shift + click to select a range
b5d84c2
first tries.....
Oct 8, 2015
3e02f19
Would be good if we had a build system
holdenk Oct 8, 2015
991e8ed
Move GoldiLocks example
holdenk Oct 8, 2015
810e7c7
fix sbt build and remove some blank lines
holdenk Oct 8, 2015
3a63a68
Add a travis config file
holdenk Oct 8, 2015
7d77086
Yay it compiles
holdenk Oct 8, 2015
6b07f48
Update packages and mllib
holdenk Oct 8, 2015
326390b
Add a bit of a note about what the params are since the coffee wasn't…
holdenk Oct 8, 2015
2d3b951
Add an Artisanal (non spark-testing-base) test
holdenk Oct 8, 2015
1f7f78a
re-indent
holdenk Oct 8, 2015
93865a6
update tag format
holdenk Oct 8, 2015
e678b1e
use _ for tag
holdenk Oct 8, 2015
1971052
update tag to ::
holdenk Oct 10, 2015
d33c271
add two kind of simple examples
holdenk Oct 11, 2015
ab0f11b
stop the spark context
holdenk Oct 11, 2015
fe14bc8
fix stuff
holdenk Oct 11, 2015
ae8fdce
Add a quick function for generating some test like data
holdenk Oct 11, 2015
b150cc3
Add tags
holdenk Oct 11, 2015
1b1874e
Add some []s to the tags
holdenk Oct 13, 2015
ccf4c9d
refactoring Quantile to be a static class
Oct 15, 2015
1056ae9
"fixing the build with refactor"
Oct 16, 2015
923fb33
Add createHiveContext and createSqlContext
holdenk Oct 18, 2015
504ba6f
add a tag around the components
holdenk Oct 18, 2015
8fe754f
correct name is examples
holdenk Oct 18, 2015
4fa1643
Add a few more spark sql examples
holdenk Oct 21, 2015
32653b7
More SQL examples
holdenk Oct 25, 2015
9eb0e4f
Add panda encoding example, fix second min happy panda filter to make…
holdenk Oct 26, 2015
3732712
build fix
holdenk Oct 26, 2015
e81c0db
Show a quick example of specifying the schema
holdenk Oct 26, 2015
e2231de
enable fork
holdenk Oct 27, 2015
f8910d1
Add verify approx by hand example
holdenk Oct 27, 2015
927d81e
We don't use sudo anyways, disable it and turn on the cache for ci
holdenk Oct 27, 2015
a093fc2
Add a SampleData tool
holdenk Oct 30, 2015
f1e6a5e
Start adding more on happy pandas magic
holdenk Nov 2, 2015
53e385b
Joins start
holdenk Nov 2, 2015
3a3b064
refactoring goldilocks and adding comments
Oct 29, 2015
c6bd667
first implementation of secondary sort with abstract types
Oct 29, 2015
be9e11d
secondary sort tests
Oct 29, 2015
341c97a
creating real unit test for secondary sort
Oct 29, 2015
424d019
adding tags to secondary sort
Nov 9, 2015
9fe50e4
fixing dantaFrame
Nov 9, 2015
575cd92
Add ze tags
holdenk Nov 9, 2015
a4eaf78
Switch our include for spark testing base to new version, stop depend…
holdenk Nov 9, 2015
7107096
Add a SQL dataframe query example
holdenk Nov 9, 2015
4bb76d5
Add []'s to tags in secondary sort
holdenk Nov 9, 2015
a62b84e
fixing tags
Nov 9, 2015
984bbdc
Fix build error
holdenk Nov 10, 2015
f6ff65f
Add a simple wordcount example
holdenk Nov 10, 2015
d714333
adding final set of examples and fixing a few problems w SecondarySor…
Nov 22, 2015
8c52af1
Add UDF & UDAF example
holdenk Nov 29, 2015
3ecc265
More json loading examples
holdenk Dec 9, 2015
fe97ea8
Switch to non deprecated API and add package for test
holdenk Dec 9, 2015
9ffec08
Add some loadsave and querying sql examples
holdenk Dec 13, 2015
5b96227
Load/save parquet
holdenk Dec 19, 2015
0b0fc3a
sometime sbt builds make me sad
holdenk Dec 19, 2015
444d8e8
Switch PandaMagic to PandaInfo/PandaPlace
holdenk Dec 21, 2015
fb2eeb4
whoops some old references
holdenk Dec 21, 2015
2bc36e0
Add pandaInfo test cases
Dec 24, 2015
e6cd103
Add pandas test cases
Dec 28, 2015
8c5071f
Tag imports
holdenk Dec 29, 2015
64b5f02
Create configurable test case for computeRelativePandaSizes
Dec 31, 2015
356d784
Reduce computeRelativePandaSizes test case size
Dec 31, 2015
e4969d8
Merge pull request #5 from mahmoudhanafy/master
holdenk Dec 31, 2015
f4363e9
Add documentation and examples to GoldiLocksFirstTry
mahmoudhanafy Jan 4, 2016
68de0e9
Merge pull request #7 from mahmoudhanafy/add-documentation-tests
rachelwarren Jan 4, 2016
b9a2413
adding one test
Jan 4, 2016
f59fd59
Merge pull request #8 from high-performance-spark/add-test-case
rachelwarren Jan 6, 2016
7c54c22
Add documentation and examples to GoldiLocksWithHashMap
mahmoudhanafy Jan 6, 2016
48fd980
Merge pull request #9 from mahmoudhanafy/add-documentation-tests
holdenk Jan 7, 2016
633f68f
update the tags in the join example so we can split them up in the sq…
holdenk Jan 7, 2016
bd08e45
Add a shell script to launch a class with mysql jdbc connector (TODO:…
holdenk Jan 11, 2016
ca171e1
Note that mysql connector is GPLd
holdenk Jan 11, 2016
69f9eee
Update exclusions
holdenk Jan 11, 2016
24dc5f4
Upgrade to 1.6.0
holdenk Jan 12, 2016
04dec46
Change generate scaling data a bit
holdenk Jan 12, 2016
ca843d1
Start adding a simple perf
holdenk Jan 12, 2016
97b709f
Add some more bits for a simple perf test
holdenk Jan 12, 2016
e9b6d7f
Support making assembly
holdenk Jan 14, 2016
9984ce9
Fix simple perf test and fix generating data to have same length of p…
holdenk Jan 14, 2016
832ec57
Fix up random number generator and add a simple perf test example
holdenk Jan 15, 2016
e272d36
minor refactor
holdenk Jan 23, 2016
fd9571a
Merge pull request #16 from holdenk/upgrade-to-1.6
holdenk Jan 23, 2016
486d492
Merge pull request #17 from holdenk/benchmark
holdenk Jan 23, 2016
106fed2
Start adding mixed dataset example
holdenk Jan 15, 2016
93152b1
Fill in some stuff
holdenk Jan 23, 2016
f32b8b6
Merge pull request #18 from holdenk/initial-dataset
holdenk Jan 23, 2016
dbc2507
Add tags
holdenk Jan 23, 2016
d087624
Add an example of purely relational queries and one of a functional q…
holdenk Jan 25, 2016
a02608e
Adding test of evaluation for chapter 4
Jan 25, 2016
200e27d
change evaluation tests to use shared spark context
Jan 25, 2016
f62d8f7
Add a basic union example
holdenk Jan 26, 2016
1602708
Merge in master
holdenk Jan 26, 2016
cedc0cb
Include spark-csv in the old fashion maven coordinates way
holdenk Jan 26, 2016
3f7c92c
s/fuzzy/squishy/ (attribute 0 is squishy in the text)
holdenk Jan 26, 2016
8ca4a75
break up the SQLHiveComponent
holdenk Jan 26, 2016
2928c93
Use RawPanda accross places a bit more
holdenk Jan 26, 2016
9e68349
add a raw panda json
holdenk Jan 26, 2016
80ebcbb
adding caching example
Jan 26, 2016
78618ec
Update the schema a bit more (include the panda type in the schema so…
holdenk Jan 27, 2016
75489d1
make nicer simple examples
holdenk Jan 27, 2016
2ec56be
Update test now that the the encoder works on raw pandas
holdenk Jan 29, 2016
e939fb3
drop duplicate example
holdenk Jan 29, 2016
e40c890
Merge pull request #20 from holdenk/sql-chapter-updates
holdenk Jan 29, 2016
32dd5be
formatting
Jan 31, 2016
7b59eb1
fix biuld
Jan 31, 2016
fc1f08b
Merge pull request #19 from high-performance-spark/chapter4
rachelwarren Jan 31, 2016
11b172b
change simple perf test to compute the average
holdenk Feb 1, 2016
6c155fb
Add a sample for chapter 4 to show saving setup overhead (Bad sample …
holdenk Feb 1, 2016
7b10eb0
Merge pull request #21 from holdenk/custom-sample-itr
holdenk Feb 1, 2016
bee09df
Merge pull request #22 from holdenk/change-benchmark-to-average
holdenk Feb 2, 2016
110567e
some more tests
Jan 31, 2016
c6083c5
with tags this time
Feb 2, 2016
ed6ff81
Merge pull request #23 from high-performance-spark/chapter4
holdenk Feb 2, 2016
b5325b4
benchmarking works better when you compare the results :p
holdenk Feb 2, 2016
e931747
Add grouped timings
holdenk Feb 2, 2016
ee789a2
do pair of rdds
holdenk Feb 2, 2016
8afc12c
switch to count
holdenk Feb 2, 2016
a2f6a23
Resolve code inspection issues
mahmoudhanafy Feb 1, 2016
2ef9bea
Add accumulator example
holdenk Feb 3, 2016
784e482
move to the correct package
holdenk Feb 3, 2016
efa0c9b
Add a test
holdenk Feb 3, 2016
0bd06da
Merge pull request #26 from holdenk/add-accumulators-example
holdenk Feb 5, 2016
d94cffe
Revert pandas schema to original one
mahmoudhanafy Feb 5, 2016
5281e5a
match description in text
holdenk Feb 11, 2016
0bfb4c8
Add a select explode example
holdenk Feb 11, 2016
856fb61
Keep full list of imports
mahmoudhanafy Feb 11, 2016
db51dac
Merge pull request #24 from mahmoudhanafy/code-inspection
holdenk Feb 11, 2016
0c5d361
Merge branch 'master' of github.com:high-performance-spark/high-perfo…
holdenk Feb 11, 2016
0d71330
Class was renamed in merge
holdenk Feb 11, 2016
da471b4
Simplify mapPartitions example and add tag
holdenk Feb 11, 2016
2bc9e39
Merge pull request #32 from holdenk/add-custom-sample-map-partitions
holdenk Feb 11, 2016
79dcb3f
Merge branch 'master' into add-custom-sample-map-partitions-r3
holdenk Feb 11, 2016
bd48249
Add a simple broadcast example
holdenk Feb 11, 2016
ae6267f
Add the tag
holdenk Feb 11, 2016
83fbe8e
Merge pull request #33 from holdenk/add-custom-sample-map-partitions-r3
holdenk Feb 11, 2016
9a5ba38
Merge pull request #34 from holdenk/broadcast-example
holdenk Feb 11, 2016
25a92d6
[WIP] placeholder for adding iter to iter
Feb 6, 2016
dd96701
adding the iterator to iterator panda
Feb 11, 2016
3ee3dec
fix indent
Feb 11, 2016
0d3b63e
fix last comment
Feb 11, 2016
c6ec145
refactoring first try to use iter to iter
Feb 11, 2016
0317de1
Merge pull request #30 from high-performance-spark/chapter4
rachelwarren Feb 11, 2016
bbf51e1
code examples to go with diagrams
Feb 16, 2016
4233556
spacing
Feb 16, 2016
a1e2b94
Merge pull request #35 from high-performance-spark/chapter4
holdenk Feb 16, 2016
066ef72
Capitalize the string in comments
holdenk Feb 22, 2016
53d6a73
Show a coffee shop join
holdenk Feb 28, 2016
d59e730
Add max panda sizes on datasets in two ways
holdenk Mar 1, 2016
bfcd99d
kill extra blank line
holdenk Mar 1, 2016
1236126
Merge pull request #36 from holdenk/add-dataset-max-by-zip
holdenk Mar 1, 2016
84b12ad
Merge branch 'master' of github.com:holdenk/high-performance-spark-ex…
holdenk Mar 4, 2016
cd681c2
Merge branch 'master' of github.com:high-performance-spark/high-perfo…
holdenk Mar 4, 2016
bd2d985
Add a sample of describe
holdenk Mar 4, 2016
bcf2a1c
Merge pull request #37 from holdenk/master
holdenk Mar 4, 2016
4142543
Add a self join example
holdenk Mar 4, 2016
dc35316
fix tag
holdenk Mar 4, 2016
d4183a7
Merge pull request #38 from holdenk/dataset-self-join
holdenk Mar 4, 2016
8cb222f
oops swap char in tag
holdenk Mar 4, 2016
92cd502
Merge pull request #39 from holdenk/dataset-self-join
holdenk Mar 4, 2016
affa2d2
add word count thing with stop words
Mar 4, 2016
ea7c994
Add self join example
holdenk Mar 5, 2016
a70d33d
Merge pull request #41 from holdenk/df-self-join
holdenk Mar 5, 2016
6cbb262
add tag to original
Mar 5, 2016
fc8988d
Merge pull request #40 from high-performance-spark/wordCountStuff
rachelwarren Mar 5, 2016
81520ee
Show a load example
holdenk Mar 6, 2016
4e74d2f
Merge pull request #42 from holdenk/jsonrdd-load
holdenk Mar 6, 2016
e3ebd0e
Fix indentation add leftouterjoin example
holdenk Mar 7, 2016
f475af5
Merge pull request #43 from holdenk/outer-rdd-joins
holdenk Mar 7, 2016
ca7981e
Cast to double for fuzzyness
holdenk Mar 8, 2016
6f5e613
Add an append example
holdenk Mar 8, 2016
841fec5
change murh to squishyness
holdenk Mar 8, 2016
177348b
specify squishyness
holdenk Mar 8, 2016
e106926
Comment typo
holdenk Mar 9, 2016
9e01fc7
Long line break
holdenk Mar 17, 2016
63e69ab
long line break
holdenk Mar 17, 2016
52a7176
typo
holdenk Mar 19, 2016
ba15f69
Add a really basic note to the README
holdenk Apr 28, 2016
f2b8897
Bump scala version to 2.11.6 andd gfortran, binutils, and R toolcahin…
holdenk Apr 28, 2016
cbab776
Add Imap package
holdenk Apr 28, 2016
a5a8091
Add the two sum files
holdenk Apr 28, 2016
5666b45
Add JniNative plugin from Jakob Odersky
holdenk Apr 28, 2016
60baf28
Go back to 2.10.4 for hive thriftserver JAR :(
holdenk Apr 29, 2016
710299c
Start adding sum wrapper example
holdenk May 1, 2016
79162cd
Update JNI build using Jakob's library (thnx\!)
holdenk May 2, 2016
92b7c3b
Plumb through the call
holdenk May 2, 2016
1a50144
Add fromRDD example for datasets
holdenk May 9, 2016
e1ebf78
More JNI stuff - but jodersky is updating his JNI wrapper so lets chi…
holdenk May 10, 2016
77b53a2
Add more test magic
holdenk May 17, 2016
fabc1ad
Start adding a PySpark example
holdenk May 17, 2016
cc6230a
Add tags
holdenk May 17, 2016
7081534
Add more packages and cache them
holdenk May 17, 2016
d7fd878
urgh pip install pandas is sad pandas
holdenk May 17, 2016
ed53801
Cache sbt launchers
holdenk May 17, 2016
4b065ac
Remove trailing slash
holdenk May 17, 2016
37527b5
fix fetch
holdenk May 17, 2016
97ac4e7
Wait crap I'm an idiot
holdenk May 17, 2016
a143793
pre 2.0 support
holdenk May 17, 2016
53dc82a
Add log4j.properties file
holdenk May 17, 2016
ba4e0d4
Somewhat hacky update to Scala 2.11 (will be better once 2.0 comes ou…
holdenk May 17, 2016
0c94c7d
Fix sample scala code
holdenk May 17, 2016
8938e28
Update to new base makefile using nativeInit CMake
holdenk May 17, 2016
229af61
Update to new Arbitraryt generator
holdenk May 17, 2016
1bf8353
Merge pull request #45 from holdenk/ch3-show-df-rt-magics
holdenk May 17, 2016
fc3332c
Use native loader magics
holdenk May 17, 2016
2abc06f
Upgrade to latest test library and enable property check
holdenk May 17, 2016
d7b66ca
Merge pull request #46 from holdenk/going-beyond-scala-branch
holdenk May 18, 2016
6a85e15
Add cut lineage Scala example
holdenk May 18, 2016
6da3ec7
Merge pull request #50 from holdenk/ch3-df-rt-scala
holdenk May 18, 2016
7ad749d
Add a simple version of the Python perf test
holdenk May 18, 2016
f7bc3e5
Fix a bunch of the timing code
holdenk May 18, 2016
c3195d9
Enable GC for timing
holdenk May 19, 2016
90c1b66
test functions were being picked up by nose
holdenk May 19, 2016
329bc3d
Merge pull request #51 from holdenk/ch3-benchmark-in-python
holdenk May 19, 2016
193a6f7
try and add coverage
holdenk May 19, 2016
09b3d85
Merge branch 'master' of github.com:high-performance-spark/high-perfo…
holdenk May 19, 2016
6fcfcf3
Upgrade java version to 8 to use lambda expressions
mahmoudhanafy May 16, 2016
da2acab
Port HappyPandas to Java
mahmoudhanafy May 16, 2016
d276733
Create separate package for java beans
mahmoudhanafy May 18, 2016
c40b69d
Remove squishPandaFromPace example
mahmoudhanafy May 18, 2016
1e58afd
Add Java8 to .travis.yml
mahmoudhanafy May 19, 2016
e792bb3
Remove startJDBCServer example
May 19, 2016
a252514
Convert java methods to static
May 20, 2016
b887283
Add simple test cases for JavaHappyPandas
May 20, 2016
6ce7baf
Implements Serializable to all Java Beans
May 21, 2016
d7de259
Port LoadSave to Java
May 21, 2016
8a395d4
Merge pull request #49 from mahmoudhanafy/port-HappyPandas-to-Java
holdenk May 23, 2016
f338714
Port UDFs to Java
mahmoudhanafy May 23, 2016
6e4ea73
Fix evaluate average UDF
May 23, 2016
98cdc48
Merge pull request #52 from mahmoudhanafy/port-UDF-to-java
holdenk May 23, 2016
0b0ea92
Add tag for basicUDF to JavaUDFs.java
holdenk May 23, 2016
5662b36
Start adding Java interop example. TODO test and make sure the non-ca…
holdenk May 24, 2016
6bd9e65
Add some tests for the JavaInterop component
holdenk May 25, 2016
2393c59
Add tags so we can include it
holdenk May 25, 2016
d9aebfd
Merge pull request #53 from holdenk/java-interop
holdenk May 25, 2016
9fd5d78
Port GoldiLocksGroupByKey to Java
May 21, 2016
abd4365
Port GoldiLocksFirstTry to Java
May 22, 2016
512c616
Use SharedSparkContext at QuantileOnlyArtisanalTest
mahmoudhanafy May 25, 2016
40b8f69
Add test case for JavaGoldiLocksFirstTry
mahmoudhanafy May 25, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ project/plugins/project/
# Scala-IDE specific
.scala_dependencies
.worksheet

# emacs stuff
\#*\#
\.\#*
*~
sbt/*launch*.jar
38 changes: 38 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
language: scala
sudo: false
apt:
- pandas
- numpy
cache:
directories:
- $HOME/.ivy2
- $HOME/spark
- $HOME/.cache/pip
- $HOME/.sbt/launchers
scala:
- 2.11.6
jdk:
- oraclejdk8
apt:
sources:
- ubuntu-toolchain-r-test
packages:
- gfortran
- gcc
- binutils
- python-pip
r_packages:
- Imap
before_install:
- pip install --user codecov unittest2 nose pep8 pylint --download-cache $HOME/.pip-cache
script:
- "export SPARK_CONF_DIR=./log4j/"
- sbt clean coverage compile test
- "[ -f spark] || mkdir spark && cd spark && wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz && cd .."
- "tar -xf ./spark/spark-1.6.1-bin-hadoop2.6.tgz"
- "export SPARK_HOME=`pwd`/spark-1.6.1-bin-hadoop2.6"
- "export PYTHONPATH=$SPARK_HOME/python:`ls -1 $SPARK_HOME/python/lib/py4j-*-src.zip`:$PYTHONPATH"
- "nosetests --with-doctest --doctest-options=+ELLIPSIS --logging-level=INFO --detailed-errors --verbosity=2 --with-coverage --cover-html-dir=./htmlcov"
after_success:
# For now no coverage report
- codecov
3 changes: 3 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
Individual components under resources are available under their own licenses.
* MySQL connector is GPL
The source code in this repo is available under the Apache License
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
# high-performance-spark-examples
Examples for High Performance Spark

# Building

Most of the examples can be built with sbt, the C and Fortran components depend on gcc, g77, and cmake.

# Tests

The full test suite depends on having the C and Fortran components built as well as a local R installation available.
90 changes: 90 additions & 0 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
organization := "com.highperformancespark"

name := "examples"

publishMavenStyle := true

version := "0.0.1"

scalaVersion := "2.11.6"

crossScalaVersions := Seq("2.11.6")

javacOptions ++= Seq("-source", "1.8", "-target", "1.8")

sparkVersion := "1.6.1"

//tag::sparkComponents[]
// TODO(Holden): re-add hive-thriftserver post Spark 2.0
sparkComponents ++= Seq("core", "streaming", "mllib")
//end::sparkComponents[]
//tag::addSQLHiveComponent[]
sparkComponents ++= Seq("sql", "hive")
//end::addSQLHiveComponent[]


parallelExecution in Test := false

fork := true

javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")

// additional libraries
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "2.2.1",
"org.scalacheck" %% "scalacheck" % "1.12.4",
"junit" % "junit" % "4.10",
// Temporary hack until Spark 2.0
"org.apache.spark" % "spark-hive-thriftserver_2.10" % "1.6.1" % "provided" intransitive(),
//tag::sparkCSV[]
"com.databricks" % "spark-csv_2.10" % "1.3.0",
//end::sparkCSV[]
"com.holdenkarau" % "spark-testing-base_2.11" % "1.6.1_0.3.3",
"org.eclipse.jetty" % "jetty-util" % "9.3.2.v20150730",
"org.codehaus.jackson" % "jackson-mapper-asl" % "1.8.8",
"com.novocode" % "junit-interface" % "0.10" % "test->default")


scalacOptions ++= Seq("-deprecation", "-unchecked")

pomIncludeRepository := { x => false }

resolvers ++= Seq(
"JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/",
"Spray Repository" at "http://repo.spray.cc/",
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Twitter4J Repository" at "http://twitter4j.org/maven2/",
"Apache HBase" at "https://repository.apache.org/content/repositories/releases",
"Twitter Maven Repo" at "http://maven.twttr.com/",
"scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools",
"sonatype-releases" at "https://oss.sonatype.org/content/repositories/releases/",
"Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/",
"Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/",
"Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven",
Resolver.sonatypeRepo("public"),
Resolver.bintrayRepo("jodersky", "sbt-jni-macros"),
"jodersky" at "https://dl.bintray.com/jodersky/maven/"
)

licenses := Seq("Apache License 2.0" -> url("http://www.apache.org/licenses/LICENSE-2.0.html"))

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList("org", "apache", xs @ _*) => MergeStrategy.first
case PathList("org", "jboss", xs @ _*) => MergeStrategy.first
case "log4j.properties" => MergeStrategy.discard
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
}

// JNI

enablePlugins(JniNative)

sourceDirectory in nativeCompile := sourceDirectory.value
40 changes: 40 additions & 0 deletions conf/log4j.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=ERROR
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
38 changes: 38 additions & 0 deletions high_performance_pyspark/SQLLineage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""
>>> from pyspark.context import SparkContext
>>> from pyspark.sql import SQLContext, Row, DataFrame
>>> sc = SparkContext('local', 'test')
...
>>> sc.setLogLevel("ERROR")
>>> sqlCtx = SQLContext(sc)
...
>>> rdd = sc.parallelize(range(1, 100)).map(lambda x: Row(i = x))
>>> df = rdd.toDF()
>>> df2 = cutLineage(df)
>>> df.head() == df2.head()
True
>>> df.schema == df2.schema
True
"""

from pyspark.sql import DataFrame

#tag::cutLineage[]
def cutLineage(df):
"""
Cut the lineage of a DataFrame - used for iterative algorithms

.. Note: This uses internal members and may break between versions
"""
jRDD = df._jdf.toJavaRDD()
jSchema = df._jdf.schema()
jRDD.cache()
sqlCtx = df.sql_ctx
try:
javaSqlCtx = sqlCtx._jsqlContext
except:
javaSqlCtx = sqlCtx._ssql_ctx
newJavaDF = javaSqlCtx.createDataFrame(jRDD, jSchema)
newDF = DataFrame(newJavaDF, sqlCtx)
return newDF
#end::cutLineage[]
25 changes: 25 additions & 0 deletions high_performance_pyspark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


"""
Python version of selected examples from High Performance Spark
"""

import os
import sys

91 changes: 91 additions & 0 deletions high_performance_pyspark/simple_perf_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# When running this example make sure to include the built Scala jar :
# $SPARK_HOME/bin/pyspark --jars ./target/examples-0.0.1.jar --driver-class-path ./target/examples-0.0.1.jar
# This example illustrates how to interface Scala and Python code, but caution
# should be taken as it depends on many private members that may change in
# future releases of Spark.

from pyspark.sql.types import *
from pyspark.sql import DataFrame
import timeit
import time

def generate_scale_data(sqlCtx, rows, numCols):
"""
Generate scale data for the performance test.

This also illustrates calling custom Scala code from the driver.

.. Note: This depends on many internal methods and may break between versions.
"""
sc = sqlCtx._sc
# Get the SQL Context, 2.0 and pre-2.0 syntax
try:
javaSqlCtx = sqlCtx._jsqlContext
except:
javaSqlCtx = sqlCtx._ssql_ctx
jsc = sc._jsc
scalasc = jsc.sc()
gateway = sc._gateway
# Call a java method that gives us back an RDD of JVM Rows (Int, Double)
# While Python RDDs are wrapped Java RDDs (even of Rows) the contents are different, so we
# can't directly wrap this.
# This returns a Java RDD of Rows - normally it would better to
# return a DataFrame directly, but for illustration we will work with an RDD
# of Rows.
java_rdd = gateway.jvm.com.highperformancespark.examples.tools.GenerateScalingData. \
generateMiniScaleRows(scalasc, rows, numCols)
# Schemas are serialized to JSON and sent back and forth
# Construct a Python Schema and turn it into a Java Schema
schema = StructType([StructField("zip", IntegerType()), StructField("fuzzyness", DoubleType())])
jschema = javaSqlCtx.parseDataType(schema.json())
# Convert the Java RDD to Java DataFrame
java_dataframe = javaSqlCtx.createDataFrame(java_rdd, jschema)
# Wrap the Java DataFrame into a Python DataFrame
python_dataframe = DataFrame(java_dataframe, sqlCtx)
# Convert the Python DataFrame into an RDD
pairRDD = python_dataframe.rdd.map(lambda row: (row[0], row[1]))
return (python_dataframe, pairRDD)

def runOnDF(df):
result = df.groupBy("zip").avg("fuzzyness").count()
return result

def runOnRDD(rdd):
result = rdd.map(lambda (x, y): (x, (y, 1))). \
reduceByKey(lambda x, y: (x[0] + y [0], x[1] + y[1])). \
count()
return result

def groupOnRDD(rdd):
return rdd.groupByKey().mapValues(lambda v: sum(v) / float(len(v))).count()

def run(sc, sqlCtx, scalingFactor, size):
(input_df, input_rdd) = generate_scale_data(sqlCtx, scalingFactor, size)
input_rdd.cache().count()
rddTimeings = timeit.repeat(stmt=lambda: runOnRDD(input_rdd), repeat=10, number=1, timer=time.time, setup='gc.enable()')
groupTimeings = timeit.repeat(stmt=lambda: groupOnRDD(input_rdd), repeat=10, number=1, timer=time.time, setup='gc.enable()')
input_df.cache().count()
dfTimeings = timeit.repeat(stmt=lambda: runOnDF(input_df), repeat=10, number=1, timer=time.time, setup='gc.enable()')
print "RDD:"
print rddTimeings
print "group:"
print groupTimeings
print "df:"
print dfTimeings
print "yay"

if __name__ == "__main__":

"""
Usage: simple_perf_test scalingFactor size
"""
import sys
from pyspark import SparkContext
from pyspark.sql import SQLContext
scalingFactor = int(sys.argv[1])
size = int(sys.argv[2])
sc = SparkContext(appName="SimplePythonPerf")
sqlCtx = SQLContext(sc)
run(sc, sqlCtx, scalingFactor, size)

sc.stop()
21 changes: 21 additions & 0 deletions project/plugins.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
addSbtPlugin("org.scalastyle" %% "scalastyle-sbt-plugin" % "0.6.0")

resolvers += "sonatype-releases" at "https://oss.sonatype.org/content/repositories/releases/"

resolvers += "sonatype-snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/"


resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"

// Temporary hack for bintray being sad

resolvers += Resolver.bintrayRepo("jodersky", "sbt-jni-macros")
resolvers += "jodersky" at "https://dl.bintray.com/jodersky/maven/"

addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.2")

//addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.0.0")

addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.3.3")

addSbtPlugin("ch.jodersky" % "sbt-jni" % "1.0.0-RC3")
Binary file added resources/mysql-connector-java-5.1.38.jar
Binary file not shown.
2 changes: 2 additions & 0 deletions resources/rawpanda.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{"name":"mission","pandas":[{"id":1,"zip":"94110","pt":"giant", "happy":true,
"attributes":[0.4,0.5]}]}
Loading