Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CHYT #224

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add CHYT #224

wants to merge 2 commits into from

Conversation

sesho96
Copy link

@sesho96 sesho96 commented Sep 16, 2024

This PR adds CHYT (ClickHouse over YTsaurus) results and benchmark scripts for ClickBench

@CLAassistant
Copy link

CLAassistant commented Sep 16, 2024

CLA assistant check
All committers have signed the CLA.

@rschu1ze rschu1ze self-assigned this Sep 16, 2024
Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

It would be great if the benchmark script could be more "automatic", i.e. download the database, configure it, start it, import the data, and run the scripts without user intervention. It will make it much easier for "outsiders" like me to reproduce the results.

chyt/README.md Show resolved Hide resolved
@@ -0,0 +1,14 @@
#### CHYT powered by ClickHouse

1. Install YTsaurus cluster. Visit [YTsaurus Getting started webpage](https://ytsaurus.tech/docs/en/overview/try-yt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the code is open-source, right? Script benchmark.sh ideally does as much setup as possible automatically, i.e. no user intervention. For examples how to do that, please see clickhouse/benchmark.sh, postgresql/benchmark.sh and duckdb/benchmark.sh.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, YTsaurus is open-source system. But CHYT is a small part of it. Benchmark.sh uses pre-installed cluster with default clique to do benchmark test.
YTsaurus cluster can be installed, for example, using k8s operator
All possible variants are described in documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, we need to reduce the variability here ... As someone who wants to verify the benchmark results, I like to run benchmark.sh and have it install everything by itself. The only think that I would be able to choose is the hardware the system runs on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a possibility to create a demo cluster for everyone who wants to try YTsaurus. Would it be ok to verify the results?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One installation option is Docker. That seems the alternative with the least amount of complexity and the best reproducibility (compared to k8s and the demo cluster).

My preference would be if benchmark.sh sets up the docker container, does other preparations, and then runs the measurements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I still don't understand what really being measured here.

If the results json files in this PR refer to measurements for a locally setup cluster and different "clique" sizes: In that case, please add deterministic setup instructions (ideally using Docker) to benchmark.sh. Also, the term "serverless" is confusing as it is used in ClickBench for (commercial) database-as-a-service offerings - please remove this term. Please specify the exact machine specs instead (CPU, RAM). Instead of five different measurement sets that were seemingly created using five different "clique" sizes, it would be good to keep it simpler, e.g. two sets of measurements.

If the results json files in this PR refer to measurements for commercial DBaaS offering with different t-shirt sizes, then please describe the needed steps to setup such a cluser in README.md.

Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploying a cluster with 360 vCPUs and 720 GB of RAM using a single VM can be quite challenging. In our case, we use a Kubernetes cluster with nodes of the type c6a.8xlarge and network SSDs that perform similarly to gp2 volumes.
We also aim to demonstrate various cluster sizes, not just the smallest one. If necessary, we can remove the "serverless" tag and instead specify the number of CHYT instances in the configuration.
Given the large size of our cluster, Docker deployment was not utilized for benchmarking, as it may yield different results. The easiest way to reproduce our results is by booking a demo cluster through our website.
Additionally, I can include a step-by-step guide in the README.md file to assist with the setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's add a step by step guide in the README, I'll afterwards try my best to reproduce, then I will merge.

{
"system": "CHYT",
"date": "2024-09-16",
"machine": "192GB",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. L. 6 says "serverless" which typically means the results were measured in a database-as-a-service offering (such as ClickHouse Cloud). Was that the case?

If not, it would be good to specify the exact machine specs for reproducibility, see e.g. duckdb/results/c5.4xlarge.json.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CHYT with 48, 96 and 192 GB we use 1, 2 and 4 instances with 12 vCPU and 48 Gb RAM
For CHYT with 360 and 720 GB -- 9 and 18 instances with 10 vCPU and 40 Gb RAM

You only configure the count and size of instances and then YTsaurus will schedule them across computational nodes of cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants