-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notable increase in memory usage from 1.25.x to 1.30.x #2500
Comments
Thanks for reporting @ArminShoeibi. I'll need to schedule some time to hopefully reproduce this so I can analyse the memory usage. |
Happy to help speed up the process! Can you let me know which .NET tool to add to our Docker image first, and what kind of dump/output would be most helpful for you |
Ah, you're running on Linux? I'm less familiar with analysing memory on Linux, but I believe dotnet-dump tool will be sufficient to get a managed memory dump that could be analysed. |
Yeah, we are using K8s and my base images are
okay, I will get a dotnet-dump with version 1.29 and will send it here to check it with the available tools. |
Thanks @ArminShoeibi. That high volume of |
That was just the baseline, here is a pod after working about 8~hrs all the Unknowns are related to Elastic.Apm The Log level is error, we don't log anything in the gateway.
|
I need to downgrade to version 1.23 to stay under the company's radar. I'll provide the corresponding dumps for version 1.23 within the next 6-7 hours |
Thanks for the thorough investigation so far, @ArminShoeibi. I appreciate you taking the time to provide these screenshots. I spent some time reviewing our logging code and profiling locally this morning. I can see the causes for the allocations. There's no apparent leak, per se, but as some undue allocations are |
This PR dramatically reduces the overhead of scoped logging, even for scenarios where the configured log level verbosity was low (e.g., Warning). The most critical improvement is caching scoped loggers within Spans, Traces, and Errors. We lose convenience in no longer having the span ID in the scope, but it's generally available from the Span argument. Previously, a new `ScopedLogger` was created for every span and transaction, and a new `ConditionalWeakTable` was also allocated. Due to the use of Finalisers on the CWT, this had an even more significant GC impact. We now maintain a cache of scoped loggers (of which there are relatively few). The ID creation for the Spans has been switched to `ActivitySpanId.CreateRandom().ToString()` from the MS diagnostics library, as this has been more heavily optimised than our original version and achieves the same goal. In `Tracer.End()`, we used an `Assertion` with a large interpolated string. This was being interpolated for each transaction as the `Assertion` path was enabled. However, the condition we assert is generally not met. Therefore, we now check this manually to only create the message if we are in an invalid state. I've also switched two structs to record structs, as these were used as keys in dictionaries and ended up being boxed due to a lack of `IEquality` implementations. These are phase one of optimisations for the logging implementation and address the general production scenario of using a less verbose log level. ### Profiling Profiling was achieved using a .NET 9 ASP.NET Core minimal API project with 20 concurrent connections and 100,000 requests. **Before Memory Profile** ![image](https://github.com/user-attachments/assets/6ba52339-dc4e-44a0-811d-039c9807c0fb) **After Memory Profile** ![image](https://github.com/user-attachments/assets/22a8fdcc-3cab-4b83-bd27-b6107f849899) **Before Top Retained** ![image](https://github.com/user-attachments/assets/348c837e-a910-4220-8a60-e12b4fd3f961) **After Top Retained** ![image](https://github.com/user-attachments/assets/d7a31e7b-f5cf-419a-8eaa-ed4042a07470) Closes #2500
@ArminShoeibi 1.31.0 was released today and should hopefully have a significantly better footprint. |
Thanks @stevejgordon, I will test it and inform you. |
Hi there,
Initial Update and Issue Discovery On November 24, 2024, at 1:26:34 AM GMT+3:30, I updated our gateway to:
Approximately 120,000 TPM (Transactions Per Minute)
Observed Issue: Notable Memory Usage Increase Following the update, a significant rise in memory usage was detected. For reference, our service's memory usage pattern is illustrated below on the morning of discovery. Fortunately, our 4GB RAM per pod capacity temporarily mitigated the immediate impact.
Troubleshooting Steps:
Rollback and Partial Update: I rolled back the changes and only updated Yarp and dThe outcome is shown below.
![Image](https://private-user-images.githubusercontent.com/68565441/389718624-c1a7b00c-1808-44cd-ba69-14cee0549360.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MDM3NTQsIm5iZiI6MTczODkwMzQ1NCwicGF0aCI6Ii82ODU2NTQ0MS8zODk3MTg2MjQtYzFhN2IwMGMtMTgwOC00NGNkLWJhNjktMTRjZWUwNTQ5MzYwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDA0NDQxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNTZlZTIwZDgzNWExNmZkODRkN2M2YzBhMTQ2NjcyNTQ5ZDNlYmVkZjc0MmQ1NmNiMDMxNmJiNWJiMjAyYTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.xGzxchCe1Mm5uI74pshNEl-ytnw_3mCWAmeqFbgBWwo)
Testing Elastic.Apm.AspNetCore Versions:
Version 1.25: Although this version exhibited errors (detailed below), memory usage was more efficient compared to version 1.23.
![Image](https://private-user-images.githubusercontent.com/68565441/389587125-783fa7a3-fb16-4266-8cbb-2489409e9a4a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MDM3NTQsIm5iZiI6MTczODkwMzQ1NCwicGF0aCI6Ii82ODU2NTQ0MS8zODk1ODcxMjUtNzgzZmE3YTMtZmIxNi00MjY2LThjYmItMjQ4OTQwOWU5YTRhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDA0NDQxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBjYjgxMWUyNTI3Mjc0NmJjNDVjODIxMDQyOGFiOTgwNjZmOWEyNjg5NjY5ZjNiYjg4YjQxZmIxYzcxNmQwYjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.r9Sq5EhVdN2J6Er3QKk1Nm_aZJYmiIboT9Oh87BDssk)
But Kibana was empty in this duration
Version 1.29.0 (Current): Memory usage increases at a gentler rate. However, version 1.25 was preferable in terms of memory efficiency, despite its error profile and the absence of data in Kibana.
![Image](https://private-user-images.githubusercontent.com/68565441/389587458-b6843b9e-2c8b-408a-be43-d3de943feb26.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MDM3NTQsIm5iZiI6MTczODkwMzQ1NCwicGF0aCI6Ii82ODU2NTQ0MS8zODk1ODc0NTgtYjY4NDNiOWUtMmM4Yi00MDhhLWJlNDMtZDNkZTk0M2ZlYjI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDA0NDQxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ4ZTM1ODFlMjY0M2Q5Mjk4MTUyZWNmZWIzY2M3MDc1OTFhM2FiNGMxZDdkNTc4ZDQ2YWY3NTBjZWI1ZWYxZTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.cihUFZZjdICs5wbVogDsTX0F4JMYjLo2ntFUcT6fMBg)
Empty APM with version 1.25
![Image](https://private-user-images.githubusercontent.com/68565441/389588528-43ac6b04-53a6-455e-b10c-b1c1a8a0e010.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MDM3NTQsIm5iZiI6MTczODkwMzQ1NCwicGF0aCI6Ii82ODU2NTQ0MS8zODk1ODg1MjgtNDNhYzZiMDQtNTNhNi00NTVlLWIxMGMtYjFjMWE4YTBlMDEwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDA0NDQxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU1MjQ3OGYwNzcyNjdlZjA0N2Q5Mzc4NTE2N2IzMDdmYWRjYTQ1NDZiY2VjOTRlNmU3MzRmMWQ0NWZmNWEyYjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.J_fqKpzJeiVRUdX02jQiQDpSOpS_a5GyylP-F18_gJA)
Additional Metrics:
Ratio: 0.1
![Image](https://private-user-images.githubusercontent.com/68565441/389587964-fb20e426-4c64-4e49-ad60-4234c5588998.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MDM3NTQsIm5iZiI6MTczODkwMzQ1NCwicGF0aCI6Ii82ODU2NTQ0MS8zODk1ODc5NjQtZmIyMGU0MjYtNGM2NC00ZTQ5LWFkNjAtNDIzNGM1NTg4OTk4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDA0NDQxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYxMGM3NTdmNmMxZjAwNWUwYjhlYWQzMWExNTAyMjU5MzdlMjM0YjA5N2Q3OGFjOTQ5YTM4MzY2ZTBiMDBlMzMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.hK5DQLRM9J6S5kyaSpO_pJ6M6l922EUymoZDvindhvc)
Throughput: Illustrated below for context.
The text was updated successfully, but these errors were encountered: