Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive resource usage from clickhouse #301

Closed
RealOrangeOne opened this issue Aug 25, 2020 · 44 comments
Closed

Excessive resource usage from clickhouse #301

RealOrangeOne opened this issue Aug 25, 2020 · 44 comments
Labels
bug Something isn't working

Comments

@RealOrangeOne
Copy link
Contributor

Bug report

Describe the bug

Clickhouse is using a crazy large amount of resources, when the plausible is getting little to no traffic.

Looking at /etc/clickhouse-server/clickhouse-server.log (crazy large file BTW), there are lots of logs stating it's querying, when nothing is happening:

<Debug> executeQuery: (from xxx.xxx.xxx.xxx:xxx) SELECT 1 FORMAT JSONCompact

These are run once every 2 seconds or so, for each connection in the pool (10 by default).

If I stop plausible, the usage goes down to nothing, so it's definitely something coming from plausible itself causing the usage, and I suspect it's these strange queries.

Expected behavior

Clickhouse only jumps up when it's doing things, and likely not to this height.

Screenshots (If applicable)

N/A

Environment (If applicable):

  • OS: Docker on Ubuntu
  • Browser: N/A
  • Version: N/A
@RealOrangeOne RealOrangeOne added the bug Something isn't working label Aug 25, 2020
@tckb
Copy link
Contributor

tckb commented Aug 25, 2020

@RealOrangeOne this is expected, I suggest you rotate the log file and reduce the log level to warn or error (I wouldn't recommend going higher than warn on prod environment) to prevent the disk overflow. The query that you see is from the health check, see https://github.com/plausible/analytics/blob/master/lib/plausible_web/controllers/api/external_controller.ex#L38

@RealOrangeOne
Copy link
Contributor Author

The log file is capped at 1000M anyway, so it's not the end of the world. But yes I'm just changing the log level and mounting the log location on a tempfs to help with that.

Changing that doesn't change the resource usage though, as i'd kinda expect. Even if I make the application completely unroutable, and so the healthcheck endpoint isn't being called, usage is unchanged.

On a side note, wouldn't running that query in the healthcheck only result in it being run once, rather than for each connection in the pool?

@ukutaht
Copy link
Contributor

ukutaht commented Aug 26, 2020

We don't control Clickhouse's resource consumption. I'll keep this issue open for a while for discussion but unless we identify an issue related to our codebase, I will close it.

On a side note, wouldn't running that query in the healthcheck only result in it being run once, rather than for each connection in the pool?

Yeah, the healthcheck should only run once and only when you actually hit the /api/health endpoint.

I'm not sure what's going on with your setup but my 2 cents is that when I run Clickhouse with Docker on my dev machine, I tend to see excessive resource usage as well. However, in our deployment I've installed it on Ubuntu directly with no Docker and it runs much better.

So my (very uninformed) guess is that CH just doesn't run very efficiently on Docker. Maybe the authors over at https://github.com/ClickHouse/ClickHouse/ can shed light on this

@RealOrangeOne
Copy link
Contributor Author

Yeah, the healthcheck should only run once and only when you actually hit the /api/health endpoint.

My thought too, which I think means there is a bug here in plausible, even if it's not the entire story of this issue?

when I run Clickhouse with Docker on my dev machine, I tend to see excessive resource usage as well

Well it's comforting to know it's not just me!

I'll try looking at either an alternative clickhouse container, or running it on the host OS and see if that makes a noticeable difference. Usage looks fine when plausible is turned off, so it's definitely query overhead rather than simply idle usage.

@RealOrangeOne
Copy link
Contributor Author

After doing a load of research, I found some issues in the default container configuration which make Clickhouse super inefficient. They're also there in the Debian package, so not sure why it's so much better on Plausible's production environment, but might just be hidden by regular user traffic 🤷.

Either way i'm pretty confident i've fixed the glaring issues with Clickhouse, although I do think it'd be worth checking into the multiple queries issue, even if just to check. But definitely not super urgent!

@ukutaht BTW if your local setup using excessive resources, I wrote up my findings which may be of interest to you! https://theorangeone.net/posts/calming-down-clickhouse/

@ukutaht
Copy link
Contributor

ukutaht commented Sep 9, 2020

@RealOrangeOne Thanks! I only got around to reading it now. I will definitely refer back to your post when I look at our production setup next time.

I think I've also figured out what's causing these SELECT 1 calls you're seeing. The database library we're using (https://github.com/clickhouse-elixir/clickhousex) uses DBConnection which itself has a mechanism to ping the database every 1000ms if the connection is idle. Check the idle_interval connection parameter here: https://hexdocs.pm/db_connection/DBConnection.html#start_link/2

I'm not sure how much sense it makes to ping every second. Could also do every 5 seconds, 10?

I intend to try out a different database library (https://github.com/CatTheMagician/pillar). The main reason is that Pillar has support for migrations. I don't know the internals of that library but it might not ping automatically.

If I end up sticking with the current library, we can default to a longer ping interval for database connections to reduce idle resource usage.

@RealOrangeOne
Copy link
Contributor Author

If the ping happens once per connection in the pool, that'd definitely do it! I guess if it's in the library there's not much we can do about it, besides make it a huge amount of time and rely on the healthcheck endpoint. Or making it user configurable depending on the scale of the deployed instance (or how much the user cares about the ping)

Tabling this until you've migrated to the alternative library is probably a good idea, as you say that might ping in a different way, or not ping at all (background pinging is a weird thing to see having worked with other DB libraries).

Migrations are definitely a thing worth having!

@ukutaht
Copy link
Contributor

ukutaht commented Sep 28, 2020

Couple of updates:

  1. I decided to stick with the current CH connection library. I'm thinking we should probably increase the default ping interval to reduce load generated from excessive pinging.

  2. Saturday night, the Clickhouse instance backing our cloud instance crashed with out of memory error. I found the underlying issue: Suspected memory leak ClickHouse/ClickHouse#7207

Basically it seems that Clickhouse needs special configuration when running with <32GB memory. I suspect most self-hosters are running on something like 1-4GB memory. So it seems to be that we should add a Clickhouse configuration file in our self-hosting documentation. It would configure CH for low-memory environment and also reduce logging as per your blog post @RealOrangeOne

@RealOrangeOne
Copy link
Contributor Author

RealOrangeOne commented Sep 28, 2020

  1. Decreasing the ping sounds great!

  2. wow that's unfortunate!

I know some self-hosters may be running on rigs with more than 32GB, but even then they probably don't want Clickhouse using all of it!

Shipping a basic config is probably a good idea, although obviously want it to be small and simple enough so it doesn't need constant maintenance. Happy to test any config changes to have!

@ukutaht
Copy link
Contributor

ukutaht commented Oct 2, 2020

@RealOrangeOne did you edit the configuration files directly in the running container? Or did you mount some extra configuration files to be merged with the default config?

I've been trying to figure out how keep using the default CH configuration and apply your override. I want to add the overrides to our hosting template.

After 1.5h I give up, I can't get the config to load properly. Couldn't manage to figure out the interactions between how clickhouse loads configuration and the details of how docker mounts single files (not directories).

@RealOrangeOne
Copy link
Contributor Author

RealOrangeOne commented Oct 2, 2020

@ukutaht I mounted some additional config files into the container, so they'd persist between restarts.

Sorry that's not clear from the article.

You can take a look at the compose file I'm using for my Plausible instance here: https://github.com/RealOrangeOne/infrastructure/tree/master/ansible/roles/plausible/files. Note that the files are copied into the right places by some external ansible scripts, but specifically take a look at the volumes: for the clickhouse container, and you'll see how the mounting works. I forget where I found the documentation on this, but it does work.

Let me know if that works 😄

@ukutaht
Copy link
Contributor

ukutaht commented Oct 2, 2020

Thanks! This is what I get:

WARNING: Found orphan containers (hosting_geoip_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
hosting_plausible_db_1 is up-to-date
Starting hosting_plausible_events_db_1 ...
Starting hosting_plausible_events_db_1 ... error

ERROR: for hosting_plausible_events_db_1  Cannot start service plausible_events_db: error while creating mount source path '/host_mnt/Users/ukutaht/dev/plausible/hosting/clickhouse/config.xml': mkdir /host_mnt/Users/ukutaht/dev/plausible/hosting/clickhouse/config.xml: file exists

ERROR: for plausible_events_db  Cannot start service plausible_events_db: error while creating mount source path '/host_mnt/Users/ukutaht/dev/plausible/hosting/clickhouse/config.xml': mkdir /host_mnt/Users/ukutaht/dev/plausible/hosting/clickhouse/config.xml: file exists
ERROR: Encountered errors while bringing up the project.

@RealOrangeOne
Copy link
Contributor Author

That's, very strange. If you can share your compose file i'm happy to give it a try locally and debug, but yeah looks like the mounts are the issue. Mine are just bare files mounted directly into the container, read-only just in case:

volumes:
      - /mnt/tank/dbs/clickhouse/plausible:/var/lib/clickhouse
      - /mnt/tank/dbs/clickhouse/docker_related_config.xml:/etc/clickhouse-server/config.d/docker_related_config.xml:ro
      - /mnt/tank/dbs/clickhouse/docker_related_user_config.xml:/etc/clickhouse-server/users.d/docker_related_user_config.xml:ro

@bokub
Copy link

bokub commented Oct 16, 2020

Basically it seems that Clickhouse needs special configuration when running with <32GB memory

Woah, that's huge!

That would explain why my micro server (550MB or RAM, 1 vCPU) has been struggling since I switched to Plausible... My server is regularly down, I often have 500 errors when trying to access my dashboards, the average response time for the tracking script is above 1000ms...

From what I understand, it's not possible to use Plausible without Clickhouse, right ?

@ukutaht
Copy link
Contributor

ukutaht commented Oct 16, 2020

Yeah, although I've been able to run Plausible OK on the smallest digital ocean droplet (1GB mem). Make sure you're not having problems because of this issue: plausible/community-edition#4

From what I understand, it's not possible to use Plausible without Clickhouse, right ?

At the moment Clickhouse is required

@malinowskip
Copy link

Hi, I installed Plausible today and have been running into this issue on a 1GB Digital Ocean droplet (CentOS) running behind Traefik.

Things work fine for a while (up to several hours), but errors like this one are eventually thrown when I try to refresh the dashboard:

Server: analytics.mysite.com:80 (http)
plausible_1            | Request: GET /api/stats/mysite.com/sources?period=30d&date=2020-10-19&from=undefined&to=undefined&filters=%7B%22goal%22%3Anull%2C%22source%22%3Anull%2C%22utm_medium%22%3Anull%2C%22utm_source%22%3Anull%2C%22utm_campaign%22%3Anull%2C%22referrer%22%3Anull%2C%22screen%22%3Anull%2C%22browser%22%3Anull%2C%22os%22%3Anull%2C%22country%22%3Anull%2C%22page%22%3Anull%7D&show_noref=false
plausible_1            | ** (exit) an exception was raised:
plausible_1            |     ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2330ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
plausible_1            |
plausible_1            |   1. By tracking down slow queries and making sure they are running fast enough
plausible_1            |   2. Increasing the pool_size (albeit it increases resource consumption)
plausible_1            |   3. Allow requests to wait longer by increasing :queue_target and :queue_interval
plausible_1            |
plausible_1            | See DBConnection.start_link/2 for more information

Eventually, this is logged: hosting_plausible_events_db_1 exited with code 137 and the analytics dashboard throws a 500 error.

@malinowskip
Copy link

Here are some more logs leading up to the crash:

plausible_1            | 23:04:49.914 [error] Clickhousex.Protocol (#PID<0.4251.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
plausible_1            | 23:04:49.914 [error] Clickhousex.Protocol (#PID<0.4254.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
plausible_1            | 23:04:49.914 [error] Clickhousex.Protocol (#PID<0.4255.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
plausible_1            | 23:04:49.916 [error] Clickhousex.Protocol (#PID<0.4245.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
hosting_plausible_events_db_1 exited with code 137

I logged memory usage at 1-second intervals. When memory is low, the analytics dashboard struggles to refresh. The final crash takes place around line 930.

@RealOrangeOne
Copy link
Contributor Author

Unless you're running the additional settings, this is likely related to the fact Clickhouse requires a lot of memory. See linked thread in #301 (comment).

The machine I run Plausible on does have far more RAM than 1GB, but i've not seen the excessive memory usage. It's still a bit of a pain to run, though.

@ukutaht
Copy link
Contributor

ukutaht commented Oct 19, 2020

@malinowskip What kind of traffic are you getting on that instance? Just curious because I'm also running a test instance on DO smallest droplet and it works fine for our landing page: https://testing.plausible.io/plausible.io/.

It's probably a matter of time until it starts running into memory issues though. At some point I will take another crack at providing a sample config for Clickhouse in limited memory environments.

@malinowskip
Copy link

malinowskip commented Oct 19, 2020

What kind of traffic are you getting on that instance?

I'm getting zero traffic, unless I visit the site myself. It looks like Clickhouse increases memory usage and frees it at regular intervals. My memory logs show that while I'm running Plausible, available memory regularly goes from around 140MB to around 30MB – back and forth – and it doesn't seem to be traffic-related.

@stevelacey
Copy link

stevelacey commented Nov 8, 2020

I didn't look into this more than attempting to set memory limits, failing, and shutting the thing down, but resource usage quickly killed my server too. The memory limits for the containers appear to default to 2gb each, and clickhouse made a quick meal of my low traffic 2gb droplet with 2gb of swap.

@RealOrangeOne
Copy link
Contributor Author

RealOrangeOne commented Nov 12, 2020

I've done some more playing with clickhouse, after noticing it was still logging to tables to disk, and finally removed them all! 🎉 I've added the details to my article linked above, and i'll try and form it into a PR soon!

Since adding all this, i've not had any issues with memory, and i've never had a leak or crash due to it. It's possible it's all related, which would be good!

@ukutaht
Copy link
Contributor

ukutaht commented Nov 13, 2020

awesome @RealOrangeOne. Would love a PR for this :)

@RealOrangeOne
Copy link
Contributor Author

I've created a pull request to add my configuration to the hosting repo: plausible/community-edition#13

I think that fix mostly alleviates the resource issues for clickhouse. It'd be great if people in this thread could try it out and see if it helps them (comment on PR rather than here). I'd also be interested in knowing how it impacts resources on the production plausible.io instance, as it could make quite a difference there!

There were some other issues mentioned here, mostly around pooling and doing aliveness checks too often. I think those should still be fixed, but it'll be less impact.

@jonathan-s
Copy link

Tangentially related; it may be that plausible has chosen to continue with clickhouse at which my comment is moot. But there are postgres extensions that implement columnar stores. One such example is https://pgxn.org/dist/cstore_fdw/

@ptman
Copy link

ptman commented Nov 16, 2020

@jonathan-s #377

@ukutaht
Copy link
Contributor

ukutaht commented Dec 9, 2020

Fixed in plausible/community-edition#13

Sorry people no plans to switch from Clickhouse. It is purpose-built for web analytics and it's been proven at ridiculous scale at Yandex.Metrica. Column-based postgresql would have acceptable query performance but it wouldn't replicate some features we rely on with Clickhouse e.g. CollapsingMergeTree table engine for realt-time session analytics

@ukutaht ukutaht closed this as completed Dec 9, 2020
@bokub
Copy link

bokub commented Dec 9, 2020

If anyone is interested, I have switched to umami, which is (IMHO) a better option if you want self-hosted analytics on a small server.

It's quite similar to Plausible, only lighter and easier to setup, while Plausible is a better suit if you're looking for performance on heavy-traffic websites.

I hope my comment is not offending anyone, I'm just trying to help people having the same problem as I encountered 😬

@mustafamizrak
Copy link

mustafamizrak commented Aug 17, 2021

I am also getting the same error with docker-compose (with external postgresql configuration)

Request: GET /sites,
21:27:12.845 [error] #PID<0.2767.0> running PlausibleWeb.Endpoint (connection #PID<0.2766.0>, stream id 1) terminated,21:27:12.845 request_id=Fpw1LHM0WhK1l1UAAAAB [warn] Failed to send Sentry event. Cannot send Sentry event because of invalid DSN,
Server: 192.***.***.***:port (http),
** (exit) an exception was raised:,
    ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2583ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:,
  1. By tracking down slow queries and making sure they are running fast enough,
,
  2. Increasing the pool_size (albeit it increases resource consumption),
  3. Allow requests to wait longer by increasing :queue_target and :queue_interval,
See DBConnection.start_link/2 for more information,
        (clickhouse_ecto 0.2.8) lib/clickhouse_ecto/connection.ex:45: ClickhouseEcto.Connection.prepare_execute/5,
        (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:692: Ecto.Adapters.SQL.execute!/4,
        (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:684: Ecto.Adapters.SQL.execute/5,
        (ecto 3.5.5) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4,
        (ecto 3.5.5) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3,
        (plausible 0.0.1) lib/plausible_web/controllers/site_controller.ex:20: PlausibleWeb.SiteController.index/2,
        (plausible 0.0.1) lib/plausible_web/controllers/site_controller.ex:1: PlausibleWeb.SiteController.action/2,
        (plausible 0.0.1) lib/plausible/stats/clickhouse.ex:879: Plausible.Stats.Clickhouse.last_24h_visitors/1,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2562.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2575.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2563.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2561.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2560.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.859 [error] Clickhousex.Protocol (#PID<0.2567.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.860 [error] Clickhousex.Protocol (#PID<0.2572.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.860 [error] Clickhousex.Protocol (#PID<0.2564.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain,
21:27:20.860 [error] Clickhousex.Protocol (#PID<0.2569.0>) failed to connect: ** (ErlangError) Erlang error: :nxdomain
...

Any ideas?

@RealOrangeOne
Copy link
Contributor Author

Given its performance related, i'd say definitely make sure you're running the latest version of Plausible (and perhaps update Clickhouse, too), and then definitely apply the config changes to Clickhouse I mentioned above.

@MurrayGroves
Copy link

I'm running the latest Plausible-CE (2.1.4) and seeing high CPU usage (~5% on a quad core i5-7600k, 3h13 CPU time on 50h server uptime). Memory usage is acceptable but still kind of high (500MB). I have approximately zero traffic going, so I'm not sure why this would be the case. I reconfigured health checks to be every 5 minutes so it's not that.

@ruslandoga
Copy link
Contributor

👋 @MurrayGroves

Would you be able to share docker stats --no-stream?

@MurrayGroves
Copy link

👋 @MurrayGroves

Would you be able to share docker stats --no-stream?

@ruslandoga here you go, with other non-Plausible services removed of course:

CONTAINER ID   NAME                                              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
e772b8e64264   plausible-ce-plausible-1                          1.22%     23.36MiB / 31.29GiB   0.07%     1.53GB / 1.34GB   181MB / 361MB     30
615e1743ee57   plausible-ce-plausible_db-1                       0.07%     35.7MiB / 31.29GiB    0.11%     269MB / 159MB     213MB / 335MB     17
c84e922c37dc   plausible-ce-plausible_events_db-1                7.74%     541.2MiB / 31.29GiB   1.69%     1.08GB / 1.37GB   1.61GB / 58.4GB   724

@MurrayGroves
Copy link

@ruslandoga
Copy link
Contributor

ruslandoga commented Oct 25, 2024

plausible container memory usage seems low, is it responding to queries and rendering dashboards fine? The Erlang VM and all the code by itself usually take up to 80MB, and then there are also geolocation databases and some caches.

But other than that, it looks normal and ClickHouse is ... doing its thing :)

Some of these discussions might be of use if want to make ClickHouse leaner:

@MurrayGroves
Copy link

@ruslandoga seems to be working with that memory usage. Just checked now and it was at 30MB, loaded up the dashboard and it went up to 60MB. I've tried the suggestions in those threads but it doesn't seem to make much of a difference unfortunately. Out of interest, what is Plausible doing in the background that takes up 1.5-2% CPU? Maybe that includes DB queries which are causing the Clickhouse usage.

@ruslandoga
Copy link
Contributor

ruslandoga commented Oct 28, 2024

I doubt there is much correlation between Plausible and ClickHouse background activities.

Out of interest, what is Plausible doing in the background that takes up 1.5-2% CPU?

Probably some Erlang stuff. I don't know if it still does it, but it used to spin CPU even when idle to get lower latencies when the actual work comes in. And also Plausible has quite a few dependencies now, some of which have background processes running: HTTP clients maintain TCP connection pools, geolocation library maintains and updates MMDB databases, telemetry collects telemetry, etc.

Maybe that includes DB queries which are causing the Clickhouse usage.

The only thing I can think of (other than emailing reports in background jobs) are connection health checks, and those use GET /ping and not real queries. So I don't think they create any real work for ClickHouse.

@MurrayGroves
Copy link

Oh well, that's a shame. Guess I'll have to find something with lower idle resource usage.

@Mubelotix
Copy link

Clickhouse wrecks my CPU, claiming 15% usage permanently even at night when all my users are asleep

@ruslandoga
Copy link
Contributor

👋 @Mubelotix

Thank you for the information! I think we can try configuring ClickHouse for a low resource usage in CE.

@ikus060
Copy link

ikus060 commented Nov 26, 2024

@ruslandoga do you mind sharing your configuration to make plausible run with lower resources ? that would be helpful. Thanks

@ruslandoga
Copy link
Contributor

👋 @ikus060

I'm running the default configuration. There is an issue open right now on lowering resource requirements for CE: plausible/community-edition#185 -- but I haven't started on it yet. My first step would be to try everything in https://clickhouse.com/docs/en/operations/tips#using-less-than-16gb-of-ram :)

@ikus060
Copy link

ikus060 commented Nov 27, 2024

@ruslandoga Thanks for pointing that out. I've made that changes yesterday. clickhouse-server is still a hungry process. Always consuming CPU and ~349MiB of Memory. Knowing how little traffic I'm managing it's a bit scary.

@ikus060
Copy link

ikus060 commented Nov 27, 2024

Here the configuration for reference:

<clickhouse>
    <mark_cache_size>524288000</mark_cache_size> <!-- 500MB in bytes -->
    <profiles>
        <default>
    <max_block_size>1024</max_block_size>
    <max_download_threads>1</max_download_threads>
                <input_format_parallel_parsing>0</input_format_parallel_parsing>
                <output_format_parallel_formatting>0</output_format_parallel_formatting>
  </default>
    </profiles>
</clickhouse>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests