Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index and LA #130

Open
evdevk opened this issue May 10, 2023 · 8 comments
Open

Index and LA #130

evdevk opened this issue May 10, 2023 · 8 comments

Comments

@evdevk
Copy link

evdevk commented May 10, 2023

I have a replicated cluster (3 nodes: 54 cpu, 256gb each). All nodes are replica. Inserts mostly goes on one node. Selects on another two.
Every night in ~00-00 something generate huge load (LA) on one of the node which are for selecting.
I am thinking this is indexing from carbon-clickhouse, but not sure.
Can you provide me some info about how cache-ttl and index working? There is no info about this in readme. How this works, for what it is?

Also can you provide me some tips to pump my config:

[data]
path = "/var/spool/carbon-tagged/"
chunk-interval = "10s"
chunk-auto-interval = ""
compression = "lz4"
compression-level = 0

[upload.graphite]
type = "points"
table = "data.data"
threads = 5
url = "http://localhost:8124/"
timeout = "1m0s"
zero-timestamp = true
compress-data = true

[upload.tags]
type = "tagged"
table = "data.tags"
threads = 6
url = "http://localhost:8124/"
timeout = "2m0s"
cache-ttl = "48h0m0s"
compress-data = true
disable-daily-index = true

[upload.graphite_index]
type = "index"
table = "data.graph_index"
threads = 3
url = "http://localhost:8124/"
timeout = "1m0s"
cache-ttl = "48h0m0s"
compress-data = true
disable-daily-index = true

Regarding to this #91

@Felixoid
Copy link
Collaborator

Felixoid commented May 10, 2023

I suggest enabling https://clickhouse.com/docs/en/operations/system-tables/query_log and then investigating what causes the mentioned issue. It will be quite easy after that.

cache-ttl AFAIR is an internal thing to not update tags and index tables.

I rather suspect that some of your users request a huge number of metrics from graphite than it's somehow related to carbon.

@evdevk
Copy link
Author

evdevk commented May 12, 2023

I've been tried to stop graphite-clickhouse (and keep running carbon-clickhouse) in 23-50 and still getting the same problem in ~00-01.
I have no cron jobs on clickhouse. Graphite rollup have bigger period than data exist (rollup is 3 days, data TTL is 2 days) so there will be no retentions active.
So, next experiment will be with stopping carbon-clickhouse.

For what actually index table exist in this soft? Can I disable it if not using grafana in my scheme?

@msaf1980
Copy link
Collaborator

msaf1980 commented May 12, 2023

Now in new day carbon-clickhouse generate insert. Not really needed and might be refactored in future. Also expire cache TTL - also produce new insert.
So, no-daily index only save disk space, not insert rate. Also produce high cpu usage, before parts are remerged (it's high cost on huge index/tags tables).

What's size/records count of your index/tags table and daily index/tags uniq metrics ?

@msaf1980
Copy link
Collaborator

Also Clickhouse try to use all avaliable cores for background processes (like merges). So restrict them for smaller value (than 54) may be a solution.

@evdevk
Copy link
Author

evdevk commented May 15, 2023

I stopped 5 mins before 00-00 carbon-clickhouse and there was zero load spikes. So, this is not graphite-clickhouse for sure.
Will try to check out click background processes and carbon-clickhouse daily index enabled. thx @msaf1980

@mikezsin
Copy link

At 00:00 it generates heavy inserts - like larger x4-5 than normal size

─query_duration_ms─┬─written_bytes─┬─http_user_agent────┬────query_start_time─┬─query─────────────────────────────────────────────────────────────────────────────┐
│ 58927 │ 1412361385 │ Go-http-client/1.1 │ 2024-04-30 00:04:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 59455 │ 1328363077 │ Go-http-client/1.1 │ 2024-04-30 00:05:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 59805 │ 1497485418 │ Go-http-client/1.1 │ 2024-04-30 00:05:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 63204 │ 1412220635 │ Go-http-client/1.1 │ 2024-04-30 00:05:00 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 60733 │ 1242829805 │ Go-http-client/1.1 │ 2024-04-30 00:05:06 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 59462 │ 1496872108 │ Go-http-client/1.1 │ 2024-04-30 00:05:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 58941 │ 1412365747 │ Go-http-client/1.1 │ 2024-04-30 00:06:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 60223 │ 1497073175 │ Go-http-client/1.1 │ 2024-04-30 00:06:02 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

│ 60893 │ 1328369735 │ Go-http-client/1.1 │ 2024-04-30 00:06:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

@spoofedpacket
Copy link

I've also been observing this recently. We see a large spike in inserts which can last around an hour:
image
We hit the "too many parts" cap in clickhouse and carbon-clickhouse queues metrics until it recovers:
[2024-06-17T14:47:16.153Z] ERROR [upload] handle failed {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1718634714840113874.lz4", "metrics": 1027490, "error": "clickhouse response status 500: Code: 252. DB::Exception: Too many parts (305). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS) (version 22.8.15.25.altinitystable (altinity build))\n", "time": 1.125482143

@mikezsin
Copy link

I found out why LA spikes, caused by rollup aggregation rules for too many metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants