Ensure Kafka Event Bus APM monitoring #658

robrap · 2024-05-24T21:53:37Z

This ticket is for determining what we want and need from Kafka APM monitoring, and implementing or spinning off appropriate tickets.

Tasks:

Open DD Support ticket for missing functionality. Maybe point to New Relic instrumentation code that does this?
Add missing data to DD. This may take manual span instrumentation, rather than waiting on DD support ticket.
Complete Kafka related monitors (for new data) and runbooks for platform-arch-bom-event-bus-safety-net. (Log based monitors exist, but are imperfect.)
Implement dashboard: Event Bus Kafka overview
@robrap: Move naming convention questions that are not needed as part of this ticket to a new ticket. (See Kafka consumer service naming conventions in DD #740)

Notes:

Do we need to come up with an operation_name value in place of django.request (for example). Something like consumer.consume (to go with kafka.consume)?
In New Relic, there are Transactions of type Message for Kafka.
- See edx-prod-discovery example transactions.
- Example transaction name: OtherTransaction/Message/Kafka/Topic/Named/prod-course-catalog-info-changed
- Provides actual message processing time details, with Trace details.
- So far, I've been unable to find this type of information in Datadog.
Here is a doc for ddtrace Kafka integration.
- It claims it will work automatically, which may only create the service:kafka operation_name:kafka.consume spans that have limited information.
- DD_KAFKA_SERVICE could be used to change service:kafka from its default.
- The question of remapping this service as been moved to Remap shared services for edx-edxapp in DD #737.
Private ticket for Datadog Support regarding orphaned spans in Kafka consumers: https://help.datadoghq.com/hc/en-us/requests/1789792

The text was updated successfully, but these errors were encountered:

robrap · 2024-05-24T22:27:09Z

I create a Slack topic with DD here: https://twou.slack.com/archives/C06QEAJHLC9/p1716589577574909?thread_ts=1716585225.853719&cid=C06QEAJHLC9 (in the private external channel).
Maybe we'll create a support ticket.

robrap · 2024-06-25T21:50:07Z

Is function_trace implemented for DD monitoring in edx-django-utils? See https://github.com/openedx/event-bus-kafka/blob/main/edx_event_bus_kafka/internal/consumer.py
Do we still need something that starts the parent spans?

robrap · 2024-07-22T20:03:29Z

@dianakhuang: This comment should probably be a new separate ticket, but adding it here to start. I noticed that the error in logs for
"failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming. I'm wondering if this has anything to do with the long-running infinite loop, and if we need to clean up the trace, like we clean up the db connection, etc.? I'm adding this here while you are thinking about this, but as I noted, it might need a separate ticket and separate DD support ticket.

UPDATE: This has been moved to a new ticket: #736

robrap · 2024-07-26T16:22:00Z

@dianakhuang:

I moved most of the service naming questions to other tickets.
However, one question for this ticket is whether the new spans you will be creating would be root spans, or if they should be child spans of the operation_name:kafka.consume spans, that are probably already available as the current span.
I updated the proposed operation name to consumer.consume (to go with the existing kafka.consume) in the PR description.

UPDATE: Added point 3 as well.

robrap · 2024-07-30T14:53:51Z

What we want:

Processing time of a message
Spans for mysql, cache, etc. that happen during the message processing
Span tags with topic, etc. in the root span.

Ideas:

Find an example kafka consume span (@kafka.received_message:True) that seems like it should be making requests or mysql spans, and ask DD support why these spans don't appear in the trace.
Check DD trace code for what it does in Kafka.

Questions:

Does the consume span close before getting the full processing time?

robrap · 2024-07-30T21:06:55Z

Note: We may want to retain 100% of spans with the newly defined operation_name. We'll see.

timmc-edx · 2024-08-05T18:31:28Z

Datadog Support confirms that there is no automatic support for connecting the producer's trace to the spans that come out of the consumer's work. However, we can implement this ourselves if we need it:

Confirming that the functionality difference you've described between NR and DD currently does not exist for us OOTB, and would require some custom code to implement. One of our engineering folks provided this example, using the ddtrace propagator class, and using a manual span to house any post-message processing:

from ddtrace import tracer, config
from ddtrace.propagation.http import HTTPPropagator as Propagator

msg = consumer.poll()

ctx = None
if msg is not None and msg.headers():
    # Extract the distributed context from message headers
    ctx = Propagator.extract(dict(msg.headers()))
with tracer.start_span(
    name="kafka-message-processing", # or whatever name they want from the manual span
    service="their service name", # match their main service name
    child_of=ctx if ctx is not None else tracer.context_provider.active(),
    activate=True
):
    # do any db or other operations that you want included in the distributed context
    db.execute()

One important note here: You'll want to ensure for both producer and consumer services, the following environment variable has been set: DD_KAFKA_PROPAGATION_ENABLED=true. Using this, the trace should include both producer and consumer spans as well as later operation spans.

(It would probably be more appropriate for us to use Span Links but those are only available via the OpenTelemetry integration.)

timmc-edx · 2024-08-06T14:46:31Z

^ Converted that distributed tracing info to its own ticket: #758

robrap · 2024-08-20T17:35:05Z

Review and possibly update the following docs:

robrap added this to Arch-BOM May 24, 2024

robrap converted this from a draft issue May 24, 2024

robrap changed the title ~~Ensure Kafka APM monitoring~~ Ensure Kafka Event Bus APM monitoring May 24, 2024

robrap moved this from Prioritized to Groomed in Arch-BOM Jun 10, 2024

robrap mentioned this issue Jun 10, 2024

Migrate Arch-BOM alerts/monitors #662

Closed

27 tasks

robrap mentioned this issue Jun 21, 2024

Disable NR for edxapp #694

Closed

12 tasks

dianakhuang self-assigned this Jul 3, 2024

dianakhuang moved this from Groomed to In Progress in Arch-BOM Jul 8, 2024

robrap mentioned this issue Jul 26, 2024

Kafka consumer service naming conventions in DD #740

Closed

dianakhuang mentioned this issue Jul 30, 2024

feat: Create manual spans for monitoring backends. openedx/edx-django-utils#435

Merged

9 tasks

dianakhuang closed this as completed Aug 21, 2024

github-project-automation bot moved this from In Progress to Done in Arch-BOM Aug 21, 2024

jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Kafka Event Bus APM monitoring #658

Ensure Kafka Event Bus APM monitoring #658

robrap commented May 24, 2024 •

edited by dianakhuang

Loading

robrap commented May 24, 2024

robrap commented Jun 25, 2024

robrap commented Jul 22, 2024 •

edited

Loading

robrap commented Jul 26, 2024 •

edited

Loading

robrap commented Jul 30, 2024

robrap commented Jul 30, 2024

timmc-edx commented Aug 5, 2024 •

edited

Loading

timmc-edx commented Aug 6, 2024

robrap commented Aug 20, 2024 •

edited by dianakhuang

Loading

Ensure Kafka Event Bus APM monitoring #658

Ensure Kafka Event Bus APM monitoring #658

Comments

robrap commented May 24, 2024 • edited by dianakhuang Loading

robrap commented May 24, 2024

robrap commented Jun 25, 2024

robrap commented Jul 22, 2024 • edited Loading

robrap commented Jul 26, 2024 • edited Loading

robrap commented Jul 30, 2024

robrap commented Jul 30, 2024

timmc-edx commented Aug 5, 2024 • edited Loading

timmc-edx commented Aug 6, 2024

robrap commented Aug 20, 2024 • edited by dianakhuang Loading

robrap commented May 24, 2024 •

edited by dianakhuang

Loading

robrap commented Jul 22, 2024 •

edited

Loading

robrap commented Jul 26, 2024 •

edited

Loading

timmc-edx commented Aug 5, 2024 •

edited

Loading

robrap commented Aug 20, 2024 •

edited by dianakhuang

Loading