spark-utilities

Utilities/applications that are built on top of Apache Spark, developed to ease end users' use cases.

##Introduction Currently this project contains two practical utilities:

sql.query
common.toParquet

###1. sql.query

This utility is similar to Spark's default spark-sql CLI, only that it provides more convenience and possibilities. Furthermore, it's tested and verified that query performance by invoking this utility is better than query performance by invoking default spark-sql CLI in most use cases.

Advantages of this sql.query utility over spark-sql CLI:

#####1) Requires no dependence on Hive
You don't have to install Hive in the same cluster where Spark is deployed, no need to move hive-site.xml to Spark's conf diretory to make Spark recognize Hive registered tables. Further, you don't have to recompile Spark to support Hive, thus to support SQL queries by invoking spark-sql CLI.

#####2) More choices for displaying query results
In spark-sql CLI, you can only print your query results to your current session terminal, when the query result is large, you may lose some result sets and can only obtain a partial query result. With sql.query utility, however, you can either print your query results to current session terminal or store your results directly onto HDFS.

#####3) Performance is promissing
In most use cases, it's tested and verified that query performance using this sql.query utility is even better than that with default spark-sql CLI. More over, in cases where some complex query can not be handled/executed by spark-sql CLI, this utility is able to finish the queries successfully.
###2. common.toParquet

This utility is to convert a text file to a parquet file given a table-schema.xml configuration file.

##Usage of the utilities #####Step1.
Compile the source code and build a jar file named "spark-utilities" in your favorite editors.
#####Step2.
Create a table-schema.xml file and move it to Spark's default conf/ directory. Make sure your table-schema.xml file conforms to the example in the conf directory.
#####Step3.

Run a query
spark-submit --master your_spark_master --class sql.query spark-utilities.jar your_query_statement [hdfs_path_to_store_your_query_results]
Convert a text file to parquet format
spark-submit --master your_spark_master --class common.toParquet spark-utilities.jar table_name target_hdfs_path_to_store_the_parquet_file

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
conf		conf
src/scala		src/scala
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-utilities

About

Releases

Packages

Languages

License

wulei-bj-cn/spark-utilities

Folders and files

Latest commit

History

Repository files navigation

spark-utilities

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages