GitHub - MOBIN-F/iceberg-spark-tpcds-benchmark: iceberg-spark-tpcds-benchmark

This is a TPCDS data generator and queries benchmark for Apache Spark,It also support iceberg data, which is fork from spark-tpcds-datagen

Note that the current master branch intends to support 3.2.1 on Scala 2.12.x.

How to generate TPCDS data

You can generate TPCDS data in /tmp/spark-tpcds-data:

# You need to set `SPARK_HOME` to your Spark v3.2.1 path before running a command below
$ ./bin/dsdgen --output-location /tmp/spark-tpcds-data [--iceberg]

Options for the generator

$ ./bin/dsdgen --help
Usage: spark-submit --class <this class> --conf key=value <spark tpcds datagen jar> [Options]
Options:
  --output-location [STR]                Path to an output location
  --scale-factor [NUM]                   Scale factor (default: 1)
  --format [STR]                         Output format (default: parquet)
  --overwrite                            Whether it overwrites existing data (default: false)
  --iceberg                              Whether it generate iceberg data (default: false)
  --partition-tables                     Whether it partitions output data (default: false)
  --use-double-for-decimal               Whether it prefers double types instead of decimal types (default: false)
  --use-string-for-char                  Whether it prefers string types instead of char/varchar types (default: false)
  --cluster-by-partition-columns         Whether it cluster output data by partition columns (default: false)
  --filter-out-null-partition-values     Whether it filters out NULL partitions (default: false)
  --table-filter [STR]                   Queries to filter, e.g., catalog_sales,store_sales
  --num-partitions [NUM]                 # of partitions (default: 100)

Options for the queries

Options:
  --data-location      Path to TPCDS data
  --query-filter       Queries to filter, e.g., q3,q5,q13
  --cbo                Whether to enable cost-based optimization
  --iceberg            Whether queries iceberg tables

Run specific TPCDS quries only

To run a part of TPCDS queries, you type:

$ ./bin/run-tpcds-benchmark --data-location [TPCDS data/Iceberg TPCDS data] --query-filter "q2,q5" [--iceberg]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bin		bin
build		build
src		src
thirdparty		thirdparty
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to generate TPCDS data

Options for the generator

Options for the queries

Run specific TPCDS quries only

About

Releases

Packages

Languages

License

MOBIN-F/iceberg-spark-tpcds-benchmark

Folders and files

Latest commit

History

Repository files navigation

How to generate TPCDS data

Options for the generator

Options for the queries

Run specific TPCDS quries only

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages