-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: Coordinator crashes with OOM in 468 #24572
Comments
Executing same load on Memory consumption after 1.5 hours:
|
OS OOM happens again
|
@guyco33 can you do a heapdump when gc pressure goes up? |
It looks we need to be looking for something off-heap (java heap seems fine). |
@losipiuk not really |
As for python: trino/plugin/trino-functions-python/src/main/java/io/trino/plugin/functions/python/PythonEngine.java Lines 122 to 131 in 6b806ae
is potentially leaky. We are not closing Instance explicitly - just assuming GC will do the job - and IDK if it does not hold some native memory.Also it looks like we are not always have allocate/deallocate pairs. E.g for argTypeAddress and returnTypeAddress in PythonEngine.doSetup . IDK if that is a problem or not. @electrum PTAL.But this is probably not what @guyco33 is observing as this should not be called if Python functions are not in use. |
I'm using |
@losipiuk There is no close for There is a potential leak of That said, that seems unrelated to this issue, as there is no evidence they are being used here. |
I ran the same tests after deleting |
@guyco33 we don't allocate off-heap memory directly, the only known instances that I'm aware of is the usage of the Apache Arrow (BigQuery, Snowflake connectors) but this is not the case here. |
@guyco33 how much memory is assigned to the JVM process? Can you lower Xmx to 70% of the quota? |
We are also experiencing the same problem after upgrading to 468. We are using s3, hive metastore and iceberg through the jdbc connector. Due to this issue, the -XX:TrimNativeHeapInterval=5000 option was added among the JVM options, and in terms of metric, native heap memory is removed periodically and fluctuates, but it continues to increase and eventually OOM occurs. |
@teddy-hackle @guyco33 can you do a thread dump? |
@wendigo I'm using 16G for Xmx on |
@guyco33 can you provide a thread dump? |
I used to observe behaviour (in different project) which suggested that AWS SDK is allocating stuff off-heap when doing PUTs for S3. |
Can one of you run Trino with JVM flag |
I'm worried about this
Trino itself is not using ByteBuffer's as the Slice is wrapping around heap-allocated Is it possible for you to attach a debugger to a running process and check |
I think that these allocations are happening in Jetty:
I've reported this issue to Jetty folks |
Attached output files of The
|
How did you downgrade them? |
Build |
Did the native memory (other) change with the downgrade? |
No. Same high value |
Not sure if it was the same reason, but I believe I faced the same issue on |
I have a production cluster running on |
@robinsinghstudios base image doesn't affect runtime |
Understood. Thanks @wendigo . |
I've tested two snapshot versions of a7e72d4
4718011
|
First 7 commits doesn't matter which leaves reactor, metrics, aws sdk, airlift |
|
so it seems to be the update of
|
Airlift 204 is among other things Jetty 12.0.16: https://github.com/airlift/airbase/releases/tag/204 |
I've created https://github.com/airlift/airlift/pull/1349/files to bound/reuse memory/heap allocations in Jetty |
I've just approved that running
|
@guyco33 can you test with the airflit fix in place? |
@guyco33 airbase is just a set of dependency declarations. Airbase defines versions for airlift and airlift is used by Trino. We need Jetty 12.0.16 as it solves other issue that we were facing so it's not really an option to downgrade. |
We've filled jetty/jetty.project#12670 to track Jetty improvement |
@wendigo - I wonder whether the above issue is impacting HTTP 1.x alone. I have been running 468 code in a cluster for past 14 days with HTTP 2.0 on and no OOMs so far. I didn't send a ton of traffic through that cluster though. @mosabua - How about adding a warning in 468 release page about this known issue? |
@sajjoseph yep, it's affecting HTTP/1 only |
@sajjoseph ... we don't really add regressions to old release notes since it would be spreading many of them all over the place and would also require testing and knowing where any issues apply. We just add a release notes entry for the fix when it comes... this one will be in 469 from what I can see at the moment. |
Coordinator in
468
consumes much more memory than467
that leads to OS OOM.What could be the reason for such memory consumption change ?
Listed below the output from free memory stats and gc logs during execution from both coordinator versions.
Trino coordinator is running on EC2 machine type
r6a.xlarge
withOpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)
jvm.conf is
The text was updated successfully, but these errors were encountered: