Skip to content

Commit

Permalink
Merge pull request 'feature/date-and-readme' (pola-rs#88) from featur…
Browse files Browse the repository at this point in the history
…e/date-and-readme into fireducks-dev

Reviewed-on: http://fire.svp.cl.nec.co.jp:3002/dpp/polars-tpch/pulls/88
  • Loading branch information
k-ishizaka committed Oct 25, 2024
2 parents b956f9f + 31b8cdf commit e791c5f
Show file tree
Hide file tree
Showing 14 changed files with 93 additions and 59 deletions.
50 changes: 25 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
polars-tpch
===========
polars-tpch with FireDucks
==========================

This repo contains the code used for performance evaluation of polars. The benchmarks are TPC-standardised queries and data designed to test the performance of "real" workflows.
This repo contains the code used for performance evaluation of FireDucks. The benchmarks are based on https://github.com/pola-rs/tpch, and queries for FireDucks are added.

From the [TPC website](https://www.tpc.org/tpch/):
> TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
See the original README [here](README_original.md).

## Generating TPC-H Data
## Instructions

### Project setup

```shell
# clone this repository
git clone https://github.com/pola-rs/tpch.git
cd tpch/tpch-dbgen

# build tpch-dbgen
make
```
# install required packages
$ sudo apt update
$ sudo apt install python3.10-venv make gcc
### Execute
# clone benchmark
$ git clone https://github.com/fireducks-dev/polars-tpch
$ cd polars-tpch
```shell
# change directory to the root of the repository
cd ../
./run.sh
```
# prepare venv for fireducks
$ python -mvenv fireducks-venv
$ fireducks-venv/bin/pip install fireducks linetimer pydantic pydantic_settings
This will do the following,
# prepare dataset by pyarrow
$ make -C tpch-dbgen dbgen
$ (cd tpch-dbgen && ./dbgen -vf -s 10)
$ (mkdir -p data/tables_pyarrow/scale-10.0 && mv tpch-dbgen/*.tbl data/tables_pyarrow/scale-10.0/)
$ PATH_TABLES=data/tables_pyarrow SCALE_FACTOR=10 ./fireducks-venv/bin/python -m scripts.prepare_data_pyarrow
$ rm data/tables_pyarrow/scale-10.0/*.tbl # to save disk space
- Create a new virtual environment with all required dependencies.
- Generate data for benchmarks.
- Run the benchmark suite.
# run with fireducks
$ PATH_TABLES=data/tables_pyarrow SCALE_FACTOR=10 RUN_IO_TYPE=skip RUN_LOG_TIMINGS=True fireducks-venv/bin/python -m queries.fireducks
# you will see all timings in `output/run/timings.csv`
```
34 changes: 34 additions & 0 deletions README_original.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
polars-tpch
===========

This repo contains the code used for performance evaluation of polars. The benchmarks are TPC-standardised queries and data designed to test the performance of "real" workflows.

From the [TPC website](https://www.tpc.org/tpch/):
> TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
## Generating TPC-H Data

### Project setup

```shell
# clone this repository
git clone https://github.com/pola-rs/tpch.git
cd tpch/tpch-dbgen

# build tpch-dbgen
make
```

### Execute

```shell
# change directory to the root of the repository
cd ../
./run.sh
```

This will do the following,

- Create a new virtual environment with all required dependencies.
- Generate data for benchmarks.
- Run the benchmark suite.
4 changes: 2 additions & 2 deletions queries/fireducks/q1.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date
from queries.fireducks import utils

Q_NUM = 1
Expand All @@ -12,7 +12,7 @@ def query():
lineitem = utils.get_line_item_ds()

q_final = (
lineitem[lineitem["l_shipdate"] <= datetime(1998, 9, 2)]
lineitem[lineitem["l_shipdate"] <= date(1998, 9, 2)]
.assign(
disc_price=lambda df: df["l_extendedprice"] * (1 - df["l_discount"])
)
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q10.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

from queries.fireducks import utils

Expand All @@ -17,8 +17,8 @@ def query():
lineitem = utils.get_line_item_ds()
nation = utils.get_nation_ds()

var1 = datetime(1993, 10, 1)
var2 = datetime(1994, 1, 1)
var1 = date(1993, 10, 1)
var2 = date(1994, 1, 1)

result = (
customer.merge(orders, left_on="c_custkey", right_on="o_custkey")
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q12.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

from queries.fireducks import utils

Expand All @@ -15,8 +15,8 @@ def query():

var1 = "MAIL"
var2 = "SHIP"
var3 = datetime(1994, 1, 1)
var4 = datetime(1995, 1, 1)
var3 = date(1994, 1, 1)
var4 = date(1995, 1, 1)
high_priorities = ["1-URGENT", "2-HIGH"]

q_final = (
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q14.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

import pandas as pd

Expand All @@ -15,8 +15,8 @@ def query():
lineitem = utils.get_line_item_ds()
part = utils.get_part_ds()

var1 = datetime(1995, 9, 1)
var2 = datetime(1995, 10, 1)
var1 = date(1995, 9, 1)
var2 = date(1995, 10, 1)

q_final = (
lineitem.merge(part, left_on="l_partkey", right_on="p_partkey")
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q15.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

from queries.fireducks import utils

Expand All @@ -13,8 +13,8 @@ def query():
supplier = utils.get_supplier_ds()
lineitem = utils.get_line_item_ds()

var1 = datetime(1996, 1, 1)
var2 = datetime(1996, 4, 1)
var1 = date(1996, 1, 1)
var2 = date(1996, 4, 1)

revenue = (
lineitem[(lineitem["l_shipdate"] >= var1) & (lineitem["l_shipdate"] < var2)]
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q20.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from queries.fireducks import utils
from datetime import datetime
from datetime import date

Q_NUM = 20

Expand All @@ -18,8 +18,8 @@ def query():
partsupp = utils.get_part_supp_ds()
part = utils.get_part_ds()

var1 = datetime(1994, 1, 1)
var2 = datetime(1995, 1, 1)
var1 = date(1994, 1, 1)
var2 = date(1995, 1, 1)
var3 = "CANADA"
var4 = "forest"

Expand Down
4 changes: 2 additions & 2 deletions queries/fireducks/q3.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

from queries.fireducks import utils

Expand All @@ -17,7 +17,7 @@ def query():
lineitem = utils.get_line_item_ds()

var1 = "BUILDING"
var2 = datetime(1995, 3, 15)
var2 = date(1995, 3, 15)

q_final = (
customer[customer["c_mktsegment"] == var1]
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q4.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date
from queries.fireducks import utils

Q_NUM = 4
Expand All @@ -13,8 +13,8 @@ def query():
lineitem = utils.get_line_item_ds()
orders = utils.get_orders_ds()

var1 = datetime(1993, 7, 1)
var2 = datetime(1993, 10, 1)
var1 = date(1993, 7, 1)
var2 = date(1993, 10, 1)

q_final = (
orders.merge(lineitem, left_on="o_orderkey", right_on="l_orderkey")
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q5.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date
from queries.fireducks import utils

Q_NUM = 5
Expand All @@ -21,8 +21,8 @@ def query():
region = utils.get_region_ds()

var1 = "ASIA"
var2 = datetime(1994, 1, 1)
var3 = datetime(1995, 1, 1)
var2 = date(1994, 1, 1)
var3 = date(1995, 1, 1)

q_final = (
region.merge(nation, left_on="r_regionkey", right_on="n_regionkey")
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q6.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

import pandas as pd

Expand All @@ -14,8 +14,8 @@ def q():
def query():
lineitem = utils.get_line_item_ds()

var1 = datetime(1994, 1, 1)
var2 = datetime(1995, 1, 1)
var1 = date(1994, 1, 1)
var2 = date(1995, 1, 1)
var3 = 0.05
var4 = 0.07
var5 = 24
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q7.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

import pandas as pd

Expand All @@ -23,8 +23,8 @@ def query():

var1 = "FRANCE"
var2 = "GERMANY"
var3 = datetime(1995, 1, 1)
var4 = datetime(1996, 12, 31)
var3 = date(1995, 1, 1)
var4 = date(1996, 12, 31)

n1 = nation[nation["n_name"] == var1]
n2 = nation[nation["n_name"] == var2]
Expand Down
6 changes: 3 additions & 3 deletions queries/fireducks/q8.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from datetime import datetime
from datetime import date

from queries.fireducks import utils

Expand Down Expand Up @@ -26,8 +26,8 @@ def query():
var1 = "BRAZIL"
var2 = "AMERICA"
var3 = "ECONOMY ANODIZED STEEL"
var4 = datetime(1995, 1, 1)
var5 = datetime(1996, 12, 31)
var4 = date(1995, 1, 1)
var5 = date(1996, 12, 31)

n1 = nation[["n_nationkey", "n_regionkey"]]
n2 = nation[["n_nationkey", "n_name"]]
Expand Down

0 comments on commit e791c5f

Please sign in to comment.