Skip to content

Querying ClickHouse database for fun and profit

Huy Do edited this page Dec 20, 2024 · 10 revisions

What is ClickHouse?

ClickHouse is an open-source column-oriented relational database that PyTorch Dev Infra team is using to store all open source data from PyTorch-org including GitHub events, test stats, benchmark results, and many more things. The database is hosted on ClickHouse Cloud and that's also where you can login and start querying the data.

Prerequisites

First time login

Skip this part if you already have access to PyTorch Dev Infra ClickHouse cluster on https://console.clickhouse.cloud/

For metamates, goto https://console.clickhouse.cloud/ and login with your Meta email. The portal uses SSO, so you just need to follow the step on your browser to request access. We grant read-only access by default.

Note that propagating the permission takes sometime from half an hour to an hour. So, you can go grab a coffee if you like.

Skim through the data we have

The list of all databases and tables on CH is at https://github.com/pytorch/test-infra/wiki/Available-databases-on-ClickHouse. If you are looking for more, please take a look at https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse and reach out to us (poc @clee2000 @huydhn) to chat about your new use cases.