Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Kafka Event Bus APM monitoring #658

Closed
5 tasks done
robrap opened this issue May 24, 2024 · 9 comments
Closed
5 tasks done

Ensure Kafka Event Bus APM monitoring #658

robrap opened this issue May 24, 2024 · 9 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented May 24, 2024

This ticket is for determining what we want and need from Kafka APM monitoring, and implementing or spinning off appropriate tickets.

Tasks:

  • Open DD Support ticket for missing functionality. Maybe point to New Relic instrumentation code that does this?
  • Add missing data to DD. This may take manual span instrumentation, rather than waiting on DD support ticket.
  • Complete Kafka related monitors (for new data) and runbooks for platform-arch-bom-event-bus-safety-net. (Log based monitors exist, but are imperfect.)
  • Implement dashboard: Event Bus Kafka overview
  • @robrap: Move naming convention questions that are not needed as part of this ticket to a new ticket. (See Kafka consumer service naming conventions in DD #740)

Notes:

  • Do we need to come up with an operation_name value in place of django.request (for example). Something like consumer.consume (to go with kafka.consume)?
  • In New Relic, there are Transactions of type Message for Kafka.
    • See edx-prod-discovery example transactions.
    • Example transaction name: OtherTransaction/Message/Kafka/Topic/Named/prod-course-catalog-info-changed
    • Provides actual message processing time details, with Trace details.
    • So far, I've been unable to find this type of information in Datadog.
  • Here is a doc for ddtrace Kafka integration.
    • It claims it will work automatically, which may only create the service:kafka operation_name:kafka.consume spans that have limited information.
    • DD_KAFKA_SERVICE could be used to change service:kafka from its default.
    • The question of remapping this service as been moved to Remap shared services for edx-edxapp in DD #737.
  • Private ticket for Datadog Support regarding orphaned spans in Kafka consumers: https://help.datadoghq.com/hc/en-us/requests/1789792
@robrap robrap added this to Arch-BOM May 24, 2024
@robrap robrap converted this from a draft issue May 24, 2024
@robrap robrap changed the title Ensure Kafka APM monitoring Ensure Kafka Event Bus APM monitoring May 24, 2024
@robrap
Copy link
Contributor Author

robrap commented May 24, 2024

@robrap robrap moved this from Prioritized to Groomed in Arch-BOM Jun 10, 2024
@robrap robrap mentioned this issue Jun 21, 2024
12 tasks
@robrap
Copy link
Contributor Author

robrap commented Jun 25, 2024

@dianakhuang dianakhuang self-assigned this Jul 3, 2024
@dianakhuang dianakhuang moved this from Groomed to In Progress in Arch-BOM Jul 8, 2024
@robrap
Copy link
Contributor Author

robrap commented Jul 22, 2024

@dianakhuang: This comment should probably be a new separate ticket, but adding it here to start. I noticed that the error in logs for
"failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming. I'm wondering if this has anything to do with the long-running infinite loop, and if we need to clean up the trace, like we clean up the db connection, etc.? I'm adding this here while you are thinking about this, but as I noted, it might need a separate ticket and separate DD support ticket.

UPDATE: This has been moved to a new ticket: #736

@robrap
Copy link
Contributor Author

robrap commented Jul 26, 2024

@dianakhuang:

  1. I moved most of the service naming questions to other tickets.
  2. However, one question for this ticket is whether the new spans you will be creating would be root spans, or if they should be child spans of the operation_name:kafka.consume spans, that are probably already available as the current span.
  3. I updated the proposed operation name to consumer.consume (to go with the existing kafka.consume) in the PR description.

UPDATE: Added point 3 as well.

@robrap
Copy link
Contributor Author

robrap commented Jul 30, 2024

What we want:

  • Processing time of a message
  • Spans for mysql, cache, etc. that happen during the message processing
  • Span tags with topic, etc. in the root span.

Ideas:

  • Find an example kafka consume span (@kafka.received_message:True) that seems like it should be making requests or mysql spans, and ask DD support why these spans don't appear in the trace.
  • Check DD trace code for what it does in Kafka.

Questions:

  • Does the consume span close before getting the full processing time?

@robrap
Copy link
Contributor Author

robrap commented Jul 30, 2024

Note: We may want to retain 100% of spans with the newly defined operation_name. We'll see.

@timmc-edx
Copy link
Member

timmc-edx commented Aug 5, 2024

Datadog Support confirms that there is no automatic support for connecting the producer's trace to the spans that come out of the consumer's work. However, we can implement this ourselves if we need it:

Confirming that the functionality difference you've described between NR and DD currently does not exist for us OOTB, and would require some custom code to implement. One of our engineering folks provided this example, using the ddtrace propagator class, and using a manual span to house any post-message processing:

from ddtrace import tracer, config
from ddtrace.propagation.http import HTTPPropagator as Propagator

msg = consumer.poll()

ctx = None
if msg is not None and msg.headers():
    # Extract the distributed context from message headers
    ctx = Propagator.extract(dict(msg.headers()))
with tracer.start_span(
    name="kafka-message-processing", # or whatever name they want from the manual span
    service="their service name", # match their main service name
    child_of=ctx if ctx is not None else tracer.context_provider.active(),
    activate=True
):
    # do any db or other operations that you want included in the distributed context
    db.execute()

One important note here: You'll want to ensure for both producer and consumer services, the following environment variable has been set: DD_KAFKA_PROPAGATION_ENABLED=true. Using this, the trace should include both producer and consumer spans as well as later operation spans.

(It would probably be more appropriate for us to use Span Links but those are only available via the OpenTelemetry integration.)

@timmc-edx
Copy link
Member

^ Converted that distributed tracing info to its own ticket: #758

@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Aug 21, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

3 participants