Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notable increase in memory usage from 1.25.x to 1.30.x #2500

Closed
ArminShoeibi opened this issue Nov 25, 2024 · 14 comments · Fixed by #2503
Closed

Notable increase in memory usage from 1.25.x to 1.30.x #2500

ArminShoeibi opened this issue Nov 25, 2024 · 14 comments · Fixed by #2503
Labels
community performance Items related to performance

Comments

@ArminShoeibi
Copy link

ArminShoeibi commented Nov 25, 2024

Hi there,

Initial Update and Issue Discovery On November 24, 2024, at 1:26:34 AM GMT+3:30, I updated our gateway to:

  • .NET 9 from 8
  • Yarp 2.2 from 2.1
  • Elastic.Apm.AspNetCore 1.30.1 from 1.23.0

Approximately 120,000 TPM (Transactions Per Minute)
Observed Issue: Notable Memory Usage Increase Following the update, a significant rise in memory usage was detected. For reference, our service's memory usage pattern is illustrated below on the morning of discovery. Fortunately, our 4GB RAM per pod capacity temporarily mitigated the immediate impact.

Image

Troubleshooting Steps:

Rollback and Partial Update: I rolled back the changes and only updated Yarp and dThe outcome is shown below.
Image

Testing Elastic.Apm.AspNetCore Versions:

Version 1.25: Although this version exhibited errors (detailed below), memory usage was more efficient compared to version 1.23.
Image
But Kibana was empty in this duration

Version 1.29.0 (Current): Memory usage increases at a gentler rate. However, version 1.25 was preferable in terms of memory efficiency, despite its error profile and the absence of data in Kibana.
Image

Empty APM with version 1.25
Image

Additional Metrics:

Ratio: 0.1
Throughput: Illustrated below for context.
Image

@ArminShoeibi
Copy link
Author

This is 1.29 right now

Image

Kibana and APM are okay, we don't have much errors (headers) but the memory is rising.

@stevejgordon
Copy link
Contributor

Thanks for reporting @ArminShoeibi. I'll need to schedule some time to hopefully reproduce this so I can analyse the memory usage.

@stevejgordon stevejgordon added performance Items related to performance community and removed community triage labels Nov 25, 2024
@ArminShoeibi
Copy link
Author

Thanks for reporting @ArminShoeibi. I'll need to schedule some time to hopefully reproduce this so I can analyse the memory usage.

Happy to help speed up the process! Can you let me know which .NET tool to add to our Docker image first, and what kind of dump/output would be most helpful for you

@stevejgordon

@stevejgordon
Copy link
Contributor

stevejgordon commented Nov 25, 2024

Ah, you're running on Linux? I'm less familiar with analysing memory on Linux, but I believe dotnet-dump tool will be sufficient to get a managed memory dump that could be analysed.

@ArminShoeibi
Copy link
Author

Ah, you're running on Linux? I'm less familiar with analysing memory on Linux, but I believe dotnet-dump tool will be sufficient to get a managed memory dump that could be analysed.

Yeah, we are using K8s and my base images are

  1. mcr.microsoft.com/dotnet/sdk:9.0.100-bookworm-slim-amd64
  2. mcr.microsoft.com/dotnet/aspnet:9.0.0-bookworm-slim-amd64

okay, I will get a dotnet-dump with version 1.29 and will send it here to check it with the available tools.

@ArminShoeibi
Copy link
Author

This picture is better

Image

by the way, we have a lot of endpoints.

Image
Image

@ArminShoeibi
Copy link
Author

ArminShoeibi commented Nov 25, 2024

this is my first gc-dump after a redeploy
Image

I will upload a complete one in 12 hours with the file itself. but right now it lives on top of others I mean the APM

flamegraph of the gcdump

Image

@stevejgordon
Copy link
Contributor

Thanks @ArminShoeibi. That high volume of LogValuesFormatter looks suspicious. What configuration are you using for the agent? Specifically, what log level have you enabled?

@ArminShoeibi
Copy link
Author

ArminShoeibi commented Nov 26, 2024

Thanks @ArminShoeibi. That high volume of LogValuesFormatter looks suspicious. What configuration are you using for the agent? Specifically, what log level have you enabled?

That was just the baseline, here is a pod after working about 8~hrs

Image

all the Unknowns are related to Elastic.Apm

Image

The Log level is error, we don't log anything in the gateway.
and our configuration is real simple

  "ElasticApm": {
    "Environment": "Production",
    "Enabled": true,
    "ServerUrl": "http://apm-server-***.***:4000/",
    "TransactionSampleRate": 0.1
  }

Image

@ArminShoeibi
Copy link
Author

I need to downgrade to version 1.23 to stay under the company's radar. I'll provide the corresponding dumps for version 1.23 within the next 6-7 hours

@ArminShoeibi
Copy link
Author

ArminShoeibi commented Nov 26, 2024

1.23 Dump (Baseline)

Installed packages:
Image

Memory usage after a Redeploy in favor of doing a clean dump.

Image

Flamegraph:
Image

Details:
Image

Image

Image

@ArminShoeibi ArminShoeibi changed the title Notable increase in memory usage from 1.25.3 to 1.30.x Notable increase in memory usage from 1.23.0 to 1.30.x Nov 26, 2024
@ArminShoeibi ArminShoeibi changed the title Notable increase in memory usage from 1.23.0 to 1.30.x Notable increase in memory usage from 1.25.x to 1.30.x Nov 26, 2024
@stevejgordon
Copy link
Contributor

Thanks for the thorough investigation so far, @ArminShoeibi. I appreciate you taking the time to provide these screenshots. I spent some time reviewing our logging code and profiling locally this morning. I can see the causes for the allocations. There's no apparent leak, per se, but as some undue allocations are ConditionalWeakTable, those have a finaliser penalty, too, which impacts how quickly those are recovered. I have several obvious quick wins I'm going to focus on as soon as possible.

stevejgordon added a commit that referenced this issue Nov 28, 2024
This PR dramatically reduces the overhead of scoped logging, even for
scenarios where the configured log level verbosity was low (e.g.,
Warning). The most critical improvement is caching scoped loggers within
Spans, Traces, and Errors. We lose convenience in no longer having the
span ID in the scope, but it's generally available from the Span
argument.

Previously, a new `ScopedLogger` was created for every span and
transaction, and a new `ConditionalWeakTable` was also allocated. Due to
the use of Finalisers on the CWT, this had an even more significant GC
impact. We now maintain a cache of scoped loggers (of which there are
relatively few).

The ID creation for the Spans has been switched to
`ActivitySpanId.CreateRandom().ToString()` from the MS diagnostics
library, as this has been more heavily optimised than our original
version and achieves the same goal.

In `Tracer.End()`, we used an `Assertion` with a large interpolated
string. This was being interpolated for each transaction as the
`Assertion` path was enabled. However, the condition we assert is
generally not met. Therefore, we now check this manually to only create
the message if we are in an invalid state.

I've also switched two structs to record structs, as these were used as
keys in dictionaries and ended up being boxed due to a lack of
`IEquality` implementations.

These are phase one of optimisations for the logging implementation and
address the general production scenario of using a less verbose log
level.

### Profiling

Profiling was achieved using a .NET 9 ASP.NET Core minimal API project
with 20 concurrent connections and 100,000 requests.

**Before Memory Profile**

![image](https://github.com/user-attachments/assets/6ba52339-dc4e-44a0-811d-039c9807c0fb)

**After Memory Profile**

![image](https://github.com/user-attachments/assets/22a8fdcc-3cab-4b83-bd27-b6107f849899)

**Before Top Retained**

![image](https://github.com/user-attachments/assets/348c837e-a910-4220-8a60-e12b4fd3f961)

**After Top Retained**

![image](https://github.com/user-attachments/assets/d7a31e7b-f5cf-419a-8eaa-ed4042a07470)

Closes #2500
@stevejgordon
Copy link
Contributor

@ArminShoeibi 1.31.0 was released today and should hopefully have a significantly better footprint.

@ArminShoeibi
Copy link
Author

@ArminShoeibi 1.31.0 was released today and should hopefully have a significantly better footprint.

Thanks @stevejgordon, I will test it and inform you.

here is the 1.23.0 for the last 7 days.
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community performance Items related to performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants