-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive resource usage from clickhouse #301
Comments
@RealOrangeOne this is expected, I suggest you rotate the log file and reduce the log level to warn or error (I wouldn't recommend going higher than warn on prod environment) to prevent the disk overflow. The query that you see is from the health check, see https://github.com/plausible/analytics/blob/master/lib/plausible_web/controllers/api/external_controller.ex#L38 |
The log file is capped at 1000M anyway, so it's not the end of the world. But yes I'm just changing the log level and mounting the log location on a Changing that doesn't change the resource usage though, as i'd kinda expect. Even if I make the application completely unroutable, and so the healthcheck endpoint isn't being called, usage is unchanged. On a side note, wouldn't running that query in the healthcheck only result in it being run once, rather than for each connection in the pool? |
We don't control Clickhouse's resource consumption. I'll keep this issue open for a while for discussion but unless we identify an issue related to our codebase, I will close it.
Yeah, the healthcheck should only run once and only when you actually hit the I'm not sure what's going on with your setup but my 2 cents is that when I run Clickhouse with Docker on my dev machine, I tend to see excessive resource usage as well. However, in our deployment I've installed it on Ubuntu directly with no Docker and it runs much better. So my (very uninformed) guess is that CH just doesn't run very efficiently on Docker. Maybe the authors over at https://github.com/ClickHouse/ClickHouse/ can shed light on this |
My thought too, which I think means there is a bug here in plausible, even if it's not the entire story of this issue?
Well it's comforting to know it's not just me! I'll try looking at either an alternative clickhouse container, or running it on the host OS and see if that makes a noticeable difference. Usage looks fine when plausible is turned off, so it's definitely query overhead rather than simply idle usage. |
After doing a load of research, I found some issues in the default container configuration which make Clickhouse super inefficient. They're also there in the Debian package, so not sure why it's so much better on Plausible's production environment, but might just be hidden by regular user traffic 🤷. Either way i'm pretty confident i've fixed the glaring issues with Clickhouse, although I do think it'd be worth checking into the multiple queries issue, even if just to check. But definitely not super urgent! @ukutaht BTW if your local setup using excessive resources, I wrote up my findings which may be of interest to you! https://theorangeone.net/posts/calming-down-clickhouse/ |
@RealOrangeOne Thanks! I only got around to reading it now. I will definitely refer back to your post when I look at our production setup next time. I think I've also figured out what's causing these SELECT 1 calls you're seeing. The database library we're using (https://github.com/clickhouse-elixir/clickhousex) uses DBConnection which itself has a mechanism to ping the database every 1000ms if the connection is idle. Check the I'm not sure how much sense it makes to ping every second. Could also do every 5 seconds, 10? I intend to try out a different database library (https://github.com/CatTheMagician/pillar). The main reason is that Pillar has support for migrations. I don't know the internals of that library but it might not ping automatically. If I end up sticking with the current library, we can default to a longer ping interval for database connections to reduce idle resource usage. |
If the ping happens once per connection in the pool, that'd definitely do it! I guess if it's in the library there's not much we can do about it, besides make it a huge amount of time and rely on the healthcheck endpoint. Or making it user configurable depending on the scale of the deployed instance (or how much the user cares about the ping) Tabling this until you've migrated to the alternative library is probably a good idea, as you say that might ping in a different way, or not ping at all (background pinging is a weird thing to see having worked with other DB libraries). Migrations are definitely a thing worth having! |
Couple of updates:
Basically it seems that Clickhouse needs special configuration when running with <32GB memory. I suspect most self-hosters are running on something like 1-4GB memory. So it seems to be that we should add a Clickhouse configuration file in our self-hosting documentation. It would configure CH for low-memory environment and also reduce logging as per your blog post @RealOrangeOne |
I know some self-hosters may be running on rigs with more than 32GB, but even then they probably don't want Clickhouse using all of it! Shipping a basic config is probably a good idea, although obviously want it to be small and simple enough so it doesn't need constant maintenance. Happy to test any config changes to have! |
@RealOrangeOne did you edit the configuration files directly in the running container? Or did you mount some extra configuration files to be merged with the default config? I've been trying to figure out how keep using the default CH configuration and apply your override. I want to add the overrides to our hosting template. After 1.5h I give up, I can't get the config to load properly. Couldn't manage to figure out the interactions between how clickhouse loads configuration and the details of how docker mounts single files (not directories). |
@ukutaht I mounted some additional config files into the container, so they'd persist between restarts. Sorry that's not clear from the article. You can take a look at the compose file I'm using for my Plausible instance here: https://github.com/RealOrangeOne/infrastructure/tree/master/ansible/roles/plausible/files. Note that the files are copied into the right places by some external ansible scripts, but specifically take a look at the Let me know if that works 😄 |
Thanks! This is what I get:
|
That's, very strange. If you can share your compose file i'm happy to give it a try locally and debug, but yeah looks like the mounts are the issue. Mine are just bare files mounted directly into the container, read-only just in case: volumes:
- /mnt/tank/dbs/clickhouse/plausible:/var/lib/clickhouse
- /mnt/tank/dbs/clickhouse/docker_related_config.xml:/etc/clickhouse-server/config.d/docker_related_config.xml:ro
- /mnt/tank/dbs/clickhouse/docker_related_user_config.xml:/etc/clickhouse-server/users.d/docker_related_user_config.xml:ro |
Woah, that's huge! That would explain why my micro server (550MB or RAM, 1 vCPU) has been struggling since I switched to Plausible... My server is regularly down, I often have 500 errors when trying to access my dashboards, the average response time for the tracking script is above 1000ms... From what I understand, it's not possible to use Plausible without Clickhouse, right ? |
Yeah, although I've been able to run Plausible OK on the smallest digital ocean droplet (1GB mem). Make sure you're not having problems because of this issue: plausible/community-edition#4
At the moment Clickhouse is required |
Hi, I installed Plausible today and have been running into this issue on a 1GB Digital Ocean droplet (CentOS) running behind Traefik. Things work fine for a while (up to several hours), but errors like this one are eventually thrown when I try to refresh the dashboard:
Eventually, this is logged: |
Here are some more logs leading up to the crash:
I logged memory usage at 1-second intervals. When memory is low, the analytics dashboard struggles to refresh. The final crash takes place around line 930. |
Unless you're running the additional settings, this is likely related to the fact Clickhouse requires a lot of memory. See linked thread in #301 (comment). The machine I run Plausible on does have far more RAM than 1GB, but i've not seen the excessive memory usage. It's still a bit of a pain to run, though. |
@malinowskip What kind of traffic are you getting on that instance? Just curious because I'm also running a test instance on DO smallest droplet and it works fine for our landing page: https://testing.plausible.io/plausible.io/. It's probably a matter of time until it starts running into memory issues though. At some point I will take another crack at providing a sample config for Clickhouse in limited memory environments. |
I'm getting zero traffic, unless I visit the site myself. It looks like Clickhouse increases memory usage and frees it at regular intervals. My memory logs show that while I'm running Plausible, available memory regularly goes from around 140MB to around 30MB – back and forth – and it doesn't seem to be traffic-related. |
I didn't look into this more than attempting to set memory limits, failing, and shutting the thing down, but resource usage quickly killed my server too. The memory limits for the containers appear to default to 2gb each, and clickhouse made a quick meal of my low traffic 2gb droplet with 2gb of swap. |
I've done some more playing with clickhouse, after noticing it was still logging to tables to disk, and finally removed them all! 🎉 I've added the details to my article linked above, and i'll try and form it into a PR soon! Since adding all this, i've not had any issues with memory, and i've never had a leak or crash due to it. It's possible it's all related, which would be good! |
awesome @RealOrangeOne. Would love a PR for this :) |
I've created a pull request to add my configuration to the hosting repo: plausible/community-edition#13 I think that fix mostly alleviates the resource issues for clickhouse. It'd be great if people in this thread could try it out and see if it helps them (comment on PR rather than here). I'd also be interested in knowing how it impacts resources on the production plausible.io instance, as it could make quite a difference there! There were some other issues mentioned here, mostly around pooling and doing aliveness checks too often. I think those should still be fixed, but it'll be less impact. |
Tangentially related; it may be that plausible has chosen to continue with clickhouse at which my comment is moot. But there are postgres extensions that implement columnar stores. One such example is https://pgxn.org/dist/cstore_fdw/ |
Fixed in plausible/community-edition#13 Sorry people no plans to switch from Clickhouse. It is purpose-built for web analytics and it's been proven at ridiculous scale at Yandex.Metrica. Column-based postgresql would have acceptable query performance but it wouldn't replicate some features we rely on with Clickhouse e.g. CollapsingMergeTree table engine for realt-time session analytics |
If anyone is interested, I have switched to umami, which is (IMHO) a better option if you want self-hosted analytics on a small server. It's quite similar to Plausible, only lighter and easier to setup, while Plausible is a better suit if you're looking for performance on heavy-traffic websites. I hope my comment is not offending anyone, I'm just trying to help people having the same problem as I encountered 😬 |
I am also getting the same error with docker-compose (with external postgresql configuration)
Any ideas? |
Given its performance related, i'd say definitely make sure you're running the latest version of Plausible (and perhaps update Clickhouse, too), and then definitely apply the config changes to Clickhouse I mentioned above. |
I'm running the latest Plausible-CE (2.1.4) and seeing high CPU usage (~5% on a quad core i5-7600k, 3h13 CPU time on 50h server uptime). Memory usage is acceptable but still kind of high (500MB). I have approximately zero traffic going, so I'm not sure why this would be the case. I reconfigured health checks to be every 5 minutes so it's not that. |
Would you be able to share |
@ruslandoga here you go, with other non-Plausible services removed of course:
|
Here's the logs too: https://filetransfer.murraygrov.es/file/0mG4YZlDCgDeLvDa/99msx36IZVKrgXfz/logs.txt |
But other than that, it looks normal and ClickHouse is ... doing its thing :) Some of these discussions might be of use if want to make ClickHouse leaner:
|
@ruslandoga seems to be working with that memory usage. Just checked now and it was at 30MB, loaded up the dashboard and it went up to 60MB. I've tried the suggestions in those threads but it doesn't seem to make much of a difference unfortunately. Out of interest, what is Plausible doing in the background that takes up 1.5-2% CPU? Maybe that includes DB queries which are causing the Clickhouse usage. |
I doubt there is much correlation between Plausible and ClickHouse background activities.
Probably some Erlang stuff. I don't know if it still does it, but it used to spin CPU even when idle to get lower latencies when the actual work comes in. And also Plausible has quite a few dependencies now, some of which have background processes running: HTTP clients maintain TCP connection pools, geolocation library maintains and updates MMDB databases, telemetry collects telemetry, etc.
The only thing I can think of (other than emailing reports in background jobs) are connection health checks, and those use |
Oh well, that's a shame. Guess I'll have to find something with lower idle resource usage. |
Clickhouse wrecks my CPU, claiming 15% usage permanently even at night when all my users are asleep |
Thank you for the information! I think we can try configuring ClickHouse for a low resource usage in CE. |
@ruslandoga do you mind sharing your configuration to make plausible run with lower resources ? that would be helpful. Thanks |
👋 @ikus060 I'm running the default configuration. There is an issue open right now on lowering resource requirements for CE: plausible/community-edition#185 -- but I haven't started on it yet. My first step would be to try everything in https://clickhouse.com/docs/en/operations/tips#using-less-than-16gb-of-ram :) |
@ruslandoga Thanks for pointing that out. I've made that changes yesterday. |
Here the configuration for reference:
|
Bug report
Describe the bug
Clickhouse is using a crazy large amount of resources, when the plausible is getting little to no traffic.
Looking at
/etc/clickhouse-server/clickhouse-server.log
(crazy large file BTW), there are lots of logs stating it's querying, when nothing is happening:<Debug> executeQuery: (from xxx.xxx.xxx.xxx:xxx) SELECT 1 FORMAT JSONCompact
These are run once every 2 seconds or so, for each connection in the pool (10 by default).
If I stop plausible, the usage goes down to nothing, so it's definitely something coming from plausible itself causing the usage, and I suspect it's these strange queries.
Expected behavior
Clickhouse only jumps up when it's doing things, and likely not to this height.
Screenshots (If applicable)
N/A
Environment (If applicable):
The text was updated successfully, but these errors were encountered: