-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Improve use of tracing spans in query path #25911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@hiltontj Here are the trace spans that are collected for IOx/InfluxDB 3 projects. These trace spans are customizable and they can be added/removed/updated/renamed per need. The IOx team at times for their works/projects create new or sub spans that can be analyzed for performance. Note: Below is the exhaustive list, and most of these traces are not captured.
|
Thanks @MaduMitha-Ravi - just to be clear, by "most of these traces are not captured", do you mean that monolith is not capturing them, while IOx is capturing them? ...or do you mean that they are not captured in general by both monolith and IOx? |
@hiltontj Yes, most of those trace spans are captured in IOx but not with Monolith. I will try to get more details today. |
Got it - that is very helpful. I can look into how IOx uses those spans and see how we can replicate. I think that would be a good starting point to help surface useful information in the tracing for your team. Beyond that, we can start adding monolith-specific traces. |
Here is the sample from IOx from a latest run that returns data,
|
@hiltontj I do understand you have so many things that needs your focus, but wondering if there is an update on the Traces. |
@pauldix I've got to the point where I roughly understand how tracing spans work in IOx & monolith and I've got a setup where I can send traces to jaeger in OSS (see #26006) but with respect to your comment in Slack:
I think this might be tough to do if you want the cache/object store spans to inherit the parent span of the calling context since all object store access and caching is mediated through the ObjectStore trait, for example: The reason this makes things difficult is that we need to be able to pass either a Span (https://github.com/influxdata/influxdb_iox/blob/7419ce7dc877d9073344e6c77221780142753322/trace/src/span.rs#L54) or a SpanContext (https://github.com/influxdata/influxdb_iox/blob/7419ce7dc877d9073344e6c77221780142753322/trace/src/ctx.rs#L44) from the caller but the ObjectStore trait methods are more or less fixed as they are (ie they don't take a Span or SpanContext). I looked through the memory cache implementation in IOx and saw a similar lack of trace span reporting. Instead they seem to rely simply on reporting cache hit/miss/error metrics: There are two approaches i can think of to work around this: Separate Span Hierarchy for cached ObjectStore implementationWith this approach we would create a separate span hierarchy just for object store / cache accesses when initializing the memory cache object store. The main problem with this is that when we are troubleshooting a problematic query span there will be nothing (as far as i can tell) that would tie the objectstore/cache access to the problematic query. Use
|
Any chance we can update the object store trait to take an optional span? Seems like that's what should be happening. Tracking metrics seems less than ideal if we can actually get tracing on it. |
Maybe if we made the changes purely additive (ie new methods on the trait with default no-op/error implementations to avoid forcing other implementers of the trait to make supporting changes), but even if we could get buy-in from maintainers (many of whom work at InfluxData as well, so that could happen) there is the problem that the tracing spans we are talking about are very bespoke to InfluxData projects: https://github.com/influxdata/influxdb3_core/tree/main/trace I think this is a problem because the I think the simplest approach is to use the |
So I just realized it's likely that we're passing these Metrics might be the best thing we can get without a major refactor spanning several repos. |
Okay I spent some time looking around IOx code to try to understand how spans like Specifically, the more interesting A similar approach could work for the
|
By the way, I realized the |
@waynr Do we have an update on the Trace spans work? |
@MaduMitha-Ravi I just merged this PR which adds tracing spans to the compactor: https://github.com/influxdata/influxdb_pro/pull/538 The object_store crate also has a new extensions feature that will let us add properly-parented spans to queries that access the object store -- this will let us discern whether a query involves object store accesses over the network or the in-memory parquet file cache is used instead. |
To be clear with respect to the object_store work, I still need to update a datafusion PR to pass on extensions from the session config to the object store API calls that it makes (see apache/datafusion#14804) |
The purpose of this issue is to get a better understanding of how
influxdb3
uses trace spans in the query path.To better help diagnose and troubleshoot latency and performance issues in the query path, we may need to better leverage tracing spans in the query path, as what we have seen in results from the perf team, the information we are generating is sparse.
Objectives of this issue
iox_query
and which is generated from this codebaseThe text was updated successfully, but these errors were encountered: