Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This is a rewrite of user annotations that attempts to address some of the issues with the previous file/json approach. At a high level, this PR introduces the following changes:
Open Questions
There are still a number of details that need further consideration.
Annotations vs Traces
Do we need annotations and traces? Or are traces enough? We can generate annotations from traces, but we need to find the right mapping from traces to annotations (some ideas are discussed below as part of other topics).
Unstructured vs Hierarchical
Do we want to enforce any structure in our traces to make it easier to display results?
At the moment spans are totally independent, no particular structure is enforced, and they can overlap (so we can potentially trace async operations, althought that's probably not our main goal). That said, it's still possible to display hierarchies by re-using the same markers/labels. For example, the following trace is displayed hierarchically by reusing the "EPOCH" marker for two different spans:
Unstructured is more flexible but also harder to visualize. The main motivation to enforce a particular structure would be to make visualization easier.
Flat vs Multi-level Traces
It's very challenging to display complex traces in system-wide mode because post-processing is very limited. For example, it's hard to sort the different spans in the trace, particularly when there's a hierarchy and/or multiple processes/ranks.
One option to address this issue in system-wide visualizations is to flatten the traces so we only display the most recent (or top) marker instead of the entire hierarchy. For example, we can display a flatter trace as follows:
Even if we simplify system-wide traces, we can keep the more complex/advanced visualizations for user-mode Grafana and/or reports.
Limit the number of trace messages per sample
At the moment the collector will attempt to receive all messages in the queue in every sample. This can be easily abused. I'd like to limit the number of messages that the collector will attempt to read per sample. It probably makes sense to make this number configurable.
We can limit by time or by number of messages. For reference, the first pull takes approximately ~40-50us, and subsequent pulls in the same iteration/sample take ~4us. In worst case scenarios, that can translate to ~0.55ms for 100 messages, or ~5.05ms for 1000 messages.
It probably makes sense to allow at least 128-256 messages (or 1 message per core in a decent CPU).
Configuration for user mode
The current approach will only send trace messages to a single Omnistat collector instance. That works fine if we are only working in either system mode or user mode. If we are using Omnistat user mode in a cluster with Omnistat system mode, we likely want to traces reported to the user mode instance. (We could try a PUB/SUB approach to communicate with multiple Omnistat instances, but I don't think it's worth the additional complexity for this particular use case).
This raises the question of how to configure the socket path in the application. I don't think it makes sense to make applications aware of the Omnistat configuration file. I'm thinking of making this configuration available to applications as an environment variable, e.g. OMNISTAT_TRACE_PATH.
Tasks