You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default, the server will only snapshot after it receives 900 WAL files and it only tries to snapshot 600 of them. Under high load this can lead to runaway memory use in the buffer.
We'll need a configuration option to set a memory threshold for when we should force a snapshot. Ideally, the threshold would be based on the size of the QueryableBuffer, but that might be inaccurate and expensive to measure. We could also use the process memory size. Let's try the QueryableBuffer option and see how it works.
Set the default value to 70% of whatever the detected system memory is.
We should have a background task that checks every 10s to see if the queryable buffer has hit this threshold. If it has, we should force a snapshot in which everything in the queryable buffer will be persisted to Parquet and all WAL files will be cleared out.
We can develop something a little more precise and less like a sledgehammer later.
The text was updated successfully, but these errors were encountered:
Ideally, the threshold would be based on the size of the QueryableBuffer, but that might be inaccurate and expensive to measure. We could also use the process memory size. Let's try the QueryableBuffer option and see how it works.
At a high level we have 2 options as you suggested there,
Use QueryableBuffer mem usage
Use process memory usage
I'm wondering if we could use process memory usage (because it's less intrusive) - I think this probably masks real buffer usage and it could be the caches using most of the memory, however it's fairly straight forward to trigger the snapshot based on whole process memory usage.
If we'd like to still proceed with the memory tracking for QueryableBuffer, I need to introduce a method to find the size of fields embedded in QueryableBuffer, mainly BufferState, LastCache, MetaCache, Persister and PersistedFiles (I think). This will be a more intrusive approach but possibly will give a more accurate reading. I also wonder if there's a way to track num bytes (at a high level within QueryableBuffer itself) added to each of these structs when a buffer op is played, so that we don't need to recursively go check the bytes consumed by caches for example. Also, deduct that usage when buffer is flushed or cache evicts entries.
Would you still like to proceed with tracking memory for QueryableBuffer and also do you think tracking bytes (and/or rows) will be better (like avoiding locks whilst calculating size of nested data structures)?
So you could get a size by walking the databases and tables and summing it all together. That might be a good place to start. The tricky bit is it won't be exact and ultimately it's the process memory that matters since that's what'll trigger an OOM kill.
So I'm honestly not sure what would be the best approach here. The other thing about process memory is that it can spike if there's some expensive query and we wouldn't want that triggering a premature persistence. So you'd need to track some sort of moving average if using that.
I think, use the QueryableBuffer size as the thing that triggers it for now and we can adjust after some testing.
By default, the server will only snapshot after it receives 900 WAL files and it only tries to snapshot 600 of them. Under high load this can lead to runaway memory use in the buffer.
We'll need a configuration option to set a memory threshold for when we should force a snapshot. Ideally, the threshold would be based on the size of the
QueryableBuffer
, but that might be inaccurate and expensive to measure. We could also use the process memory size. Let's try the QueryableBuffer option and see how it works.Set the default value to 70% of whatever the detected system memory is.
We should have a background task that checks every 10s to see if the queryable buffer has hit this threshold. If it has, we should force a snapshot in which everything in the queryable buffer will be persisted to Parquet and all WAL files will be cleared out.
We can develop something a little more precise and less like a sledgehammer later.
The text was updated successfully, but these errors were encountered: