Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling NR APM causes trace concatenation in Datadog #692

Closed
22 tasks done
timmc-edx opened this issue Jun 18, 2024 · 5 comments
Closed
22 tasks done

Disabling NR APM causes trace concatenation in Datadog #692

timmc-edx opened this issue Jun 18, 2024 · 5 comments
Assignees

Comments

@timmc-edx
Copy link
Member

timmc-edx commented Jun 18, 2024

Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.

This bug has been observed in edxapp (LMS and CMS), enterprise-catalog, and registrar. It can be identified by searching for spans matching operation_name:django.request -@_top_level:*.

Acceptance criteria

Things we have already tried

These should be checked off once they have already been either reverted or made permanent:

  • [keep] Set DD_DJANGO_INSTRUMENT_MIDDLEWARE to reduce the noise when debugging huge traces.
    • No compelling reason to change this back any time soon. Added to Datadog docs as an option.
  • [reverted] Set DD_TRACE_HEADER_TAGS to debug tracing headers
  • [keep] We've changed a number of monitors and dashboards from using Service Entry Spans to operation_name:django.request on All Spans since service entry spans were unreliable.
    • No reason to change this back.
  • [reverted] We removed the NR license key for stage and edge LMS
  • [reverted] We set up a free-tier key as a test in stage and edge LMS
  • [reverted] Experimental patched newrelic agent and the enablement flag
  • [reverted] Set DD_TRACE_PROPAGATION_STYLE_EXTRACT=none
  • DatadogDiagnostics plugin app; addition of that package's middleware via remote-config; the settings and waffle flags that activate or control them. Removal strictly in this order unless otherwise specified:
    • Watch for activity with service:edx-edxapp-* dirname:"/edx/var/log/supervisor" "[edx_arch_experiments.datadog_diagnostics.middleware]"
    • Remove middleware references from remote-config (EXTRA_MIDDLEWARE_CLASSES Django setting) [stage and prod, edge]
    • Remove entire app and middleware from edx-arch-experiments [PR, release, stg/prod bump, edge bump]
    • Remove any Django settings that those used (all prefixed with DATADOG_DIAGNOSTICS_) -- if it's just DATADOG_DIAGNOSTICS_ENABLE it can be merged in any order, as it's just controlling noisy logs we don't have any more. [prod LMS was only instance]
    • Delete Waffle flags (all prefixed with datadog.diagnostics.) -- merge in any order, as these turn features on
  • [reverted] Disable Celery instrumentation via DD_TRACE_CELERY_ENABLED=false, because some of the request spans in anomalous traces have missing parent spans that were celery-related.
  • [reverted] Debug logging of celery signals via DATADOG_DIAGNOSTICS_CELERY_LOG_SIGNALS (using edx-arch-experiments 4.3.0)
  • [reverted] Log span ancestry on anomalous traces to find out what operations are involved.
  • Use a release-candidate of ddtrace (via EDXAPP_DDTRACE_PIP_SPEC) that closes celery spans using a fallback

Details

When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:

  • At about 17:34, the rate of service entry spans (and traces) in service:edx-edxapp-lms env:prod dropped precipitously by 2-3x.
  • At the same time, the rate of all spans increased by about 5x. (This anomaly appears to have disappeared, retroactively!)

However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.

Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:

  • Monitors and dashboards that are based on traces or service entry spans are unreliable.
  • It's difficult to impossible to view most of our traces because they become concatenated into 50,000-span-long monsters that can't be loaded in the browser.

Image

We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).

Links

@robrap
Copy link
Contributor

robrap commented Jun 18, 2024

Other notes:

@robrap
Copy link
Contributor

robrap commented Jun 21, 2024

Additional thoughts, questions, ideas:

  • We should double-check distributed tracing settings for tracing type (want updated Distributed Tracing).
  • If we decide to try to use the NR test account, we should ensure it is safe. We can remove other users (if they are still there).
  • What does the nginx request header set? Could this be interfering?
  • APM Profiling caused a different problem with additional spans, but is it possible it could heal this issue without NR?
  • If it comes to this, can we find a way to replicate locally?
  • Could there be important information buried in the logs that we are no longer seeing? See Determine if Datadog trace-debug fix needs replacement #610.
  • Could the invalid traceparent log messages be related? See Ensure "received invalid w3c traceparent" has resolved #665.

@robrap robrap mentioned this issue Jun 24, 2024
7 tasks
timmc-edx added a commit to edx/configuration that referenced this issue Jun 25, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Jun 25, 2024
@robrap
Copy link
Contributor

robrap commented Jun 25, 2024

Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working.

Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:

prod-edge-edxapp-cms 9.28 M
prod-edge-analytics-api 2.35 M
prod-edge-notes 2.3 M
prod-edge-edxapp-workers-lms 755 k
prod-edge-forum 675 k
prod-edge-edxapp-workers-cms 188 k

timmc-edx added a commit to edx/newrelic-python-agent that referenced this issue Jun 26, 2024
…tation) (#1)

If the Django setting `EDX_NEWRELIC_NO_REPORT` is present and enabled, the agent will not talk to New Relic's servers and will instead use a set of previously captured responses from our sandbox account.

Instrumentation (tracing, etc.) will still be in place, but the data will be discarded rather than being reported.

See edx/edx-arch-experiments#692
@robrap
Copy link
Contributor

robrap commented Jun 27, 2024

[idea] We might want 3 modes for our hacked NR agent:

  1. Send no data to NR, but add tracing info (fixes DD traces).
  2. Send no data to NR, and fake a bad account id so we don't even take a performance hit (breaks DD traces).
  3. Send data to NR (costs money, but if we want to verify anything temporarily, we can.

timmc-edx added a commit that referenced this issue Jul 10, 2024
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`

NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when
shut down with ctrl-c, but not when restarted due to autoreload (where it
prints correct info). Something is initializing Django twice and one span
processor is getting span info while the other is printing at shutdown.
There's more to debug here, but it seems stable enough to least try
deploying it.
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`

NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when
shut down with ctrl-c, but not when restarted due to autoreload (where it
prints correct info). Something is initializing Django twice and one span
processor is getting span info while the other is printing at shutdown.
There's more to debug here, but it seems stable enough to least try
deploying it.
timmc-edx added a commit that referenced this issue Jul 24, 2024
Adds logging diagnostics for traces in Datadog.

See #692
timmc-edx added a commit that referenced this issue Jul 24, 2024
Adds logging diagnostics for traces in Datadog.

See #692
@robrap robrap removed their assignment Aug 8, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Aug 16, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Aug 20, 2024
timmc-edx added a commit to openedx/edx-platform that referenced this issue Sep 13, 2024
- Convert `/heartbeat` view into a celery test
- Send Celery tasks to a broker, rather than running in-process
- Hardcode a broker URL
- Log all celery signals

See edx/edx-arch-experiments#692
timmc-edx added a commit to openedx/edx-platform that referenced this issue Sep 13, 2024
- Add `/celery_repro` URL to run a sample task
- Send Celery tasks to a broker, rather than running in-process
- Hardcode a broker URL
- Log all celery signals

See edx/edx-arch-experiments#692
timmc-edx added a commit to edx/configuration that referenced this issue Oct 3, 2024
Introduces `EDXAPP_NEWRELIC_ENABLE` and sets it to false so edxapp is no
longer drawing from the common `COMMON_ENABLE_NEWRELIC_APP` variable.

This is now possible thanks to fixes in ddtrace 2.14.2.

See edx/edx-arch-experiments#692
timmc-edx added a commit to edx/configuration that referenced this issue Oct 3, 2024
…le (#74)

Introduces `EDXAPP_NEWRELIC_ENABLE` and sets it to false so edxapp is no
longer drawing from the common `COMMON_ENABLE_NEWRELIC_APP` variable.

This is now possible thanks to fixes in ddtrace 2.14.2.

See edx/edx-arch-experiments#692
@timmc-edx
Copy link
Member Author

We'll leave the datadog_diagnostics cleanup task for about a week, for cleanup 2024-10-14 or later. Moving to blocked for until then.

@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Oct 4, 2024
@jristau1984 jristau1984 moved this from Blocked to In Progress in Arch-BOM Oct 21, 2024
timmc-edx added a commit that referenced this issue Oct 22, 2024
These are no longer in use in edxapp as of:

- edx/edx-internal#11806 (prod, stage)
- edx/edge-internal#790 (edge)

Also remove testing dependencies that were only in use by this app.

See #692 for merge order.
timmc-edx added a commit that referenced this issue Oct 22, 2024
These are no longer in use in edxapp as of:

- edx/edx-internal#11806 (prod, stage)
- edx/edge-internal#790 (edge)

Also remove testing dependencies that were only in use by this app.

See #692 for merge order.
timmc-edx added a commit that referenced this issue Oct 22, 2024
These are no longer in use in edxapp as of:

- edx/edx-internal#11806 (prod, stage)
- edx/edge-internal#790 (edge)

Also remove testing dependencies that were only in use by this app.

See #692 for merge order.
@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Oct 22, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

2 participants