Add CHYT #224

sesho96 · 2024-09-16T09:30:44Z

This PR adds CHYT (ClickHouse over YTsaurus) results and benchmark scripts for ClickBench

CLAassistant · 2024-09-16T09:30:51Z

All committers have signed the CLA.

rschu1ze

Thanks for the PR.

It would be great if the benchmark script could be more "automatic", i.e. download the database, configure it, start it, import the data, and run the scripts without user intervention. It will make it much easier for "outsiders" like me to reproduce the results.

chyt/README.md

rschu1ze · 2024-09-17T11:38:44Z

chyt/README.md

@@ -0,0 +1,14 @@
+#### CHYT powered by ClickHouse
+
+1.  Install YTsaurus cluster. Visit [YTsaurus Getting started webpage](https://ytsaurus.tech/docs/en/overview/try-yt)


As far as I understand, the code is open-source, right? Script benchmark.sh ideally does as much setup as possible automatically, i.e. no user intervention. For examples how to do that, please see clickhouse/benchmark.sh, postgresql/benchmark.sh and duckdb/benchmark.sh.

Yes, YTsaurus is open-source system. But CHYT is a small part of it. Benchmark.sh uses pre-installed cluster with default clique to do benchmark test.
YTsaurus cluster can be installed, for example, using k8s operator
All possible variants are described in documentation.

As I mentioned, we need to reduce the variability here ... As someone who wants to verify the benchmark results, I like to run benchmark.sh and have it install everything by itself. The only think that I would be able to choose is the hardware the system runs on.

We have a possibility to create a demo cluster for everyone who wants to try YTsaurus. Would it be ok to verify the results?

One installation option is Docker. That seems the alternative with the least amount of complexity and the best reproducibility (compared to k8s and the demo cluster).

My preference would be if benchmark.sh sets up the docker container, does other preparations, and then runs the measurements.

I guess I still don't understand what really being measured here.

If the results json files in this PR refer to measurements for a locally setup cluster and different "clique" sizes: In that case, please add deterministic setup instructions (ideally using Docker) to benchmark.sh. Also, the term "serverless" is confusing as it is used in ClickBench for (commercial) database-as-a-service offerings - please remove this term. Please specify the exact machine specs instead (CPU, RAM). Instead of five different measurement sets that were seemingly created using five different "clique" sizes, it would be good to keep it simpler, e.g. two sets of measurements.

If the results json files in this PR refer to measurements for commercial DBaaS offering with different t-shirt sizes, then please describe the needed steps to setup such a cluser in README.md.

Thanks.

Deploying a cluster with 360 vCPUs and 720 GB of RAM using a single VM can be quite challenging. In our case, we use a Kubernetes cluster with nodes of the type c6a.8xlarge and network SSDs that perform similarly to gp2 volumes.
We also aim to demonstrate various cluster sizes, not just the smallest one. If necessary, we can remove the "serverless" tag and instead specify the number of CHYT instances in the configuration.
Given the large size of our cluster, Docker deployment was not utilized for benchmarking, as it may yield different results. The easiest way to reproduce our results is by booking a demo cluster through our website.
Additionally, I can include a step-by-step guide in the README.md file to assist with the setup.

Okay, let's add a step by step guide in the README, I'll afterwards try my best to reproduce, then I will merge.

rschu1ze · 2024-09-17T11:41:55Z

chyt/results/yt.192GB_YC.json

+{
+    "system": "CHYT",
+    "date": "2024-09-16",
+    "machine": "192GB",


I am confused. L. 6 says "serverless" which typically means the results were measured in a database-as-a-service offering (such as ClickHouse Cloud). Was that the case?

If not, it would be good to specify the exact machine specs for reproducibility, see e.g. duckdb/results/c5.4xlarge.json.

For CHYT with 48, 96 and 192 GB we use 1, 2 and 4 instances with 12 vCPU and 48 Gb RAM
For CHYT with 360 and 720 GB -- 9 and 18 instances with 10 vCPU and 40 Gb RAM

You only configure the count and size of instances and then YTsaurus will schedule them across computational nodes of cluster.

sesho96 added 2 commits September 16, 2024 11:47

Add CHYT

934ae76

Fix readme

ab23002

rschu1ze self-assigned this Sep 16, 2024

rschu1ze reviewed Sep 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CHYT #224

Add CHYT #224

sesho96 commented Sep 16, 2024

CLAassistant commented Sep 16, 2024 •

edited

Loading

rschu1ze left a comment

rschu1ze Sep 17, 2024

sesho96 Sep 17, 2024

rschu1ze Sep 17, 2024

sesho96 Sep 18, 2024

rschu1ze Sep 23, 2024

rschu1ze Sep 23, 2024

sesho96 Sep 23, 2024

rschu1ze Sep 23, 2024

rschu1ze Sep 17, 2024

sesho96 Sep 17, 2024

		@@ -0,0 +1,14 @@
		#### CHYT powered by ClickHouse

		1. Install YTsaurus cluster. Visit [YTsaurus Getting started webpage](https://ytsaurus.tech/docs/en/overview/try-yt)

Add CHYT #224

Are you sure you want to change the base?

Add CHYT #224

Conversation

sesho96 commented Sep 16, 2024

CLAassistant commented Sep 16, 2024 • edited Loading

rschu1ze left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Sep 16, 2024 •

edited

Loading