Skip to content

[ENH]: add execution time to read/write tracing #4616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

c-gamble
Copy link

@c-gamble c-gamble commented May 23, 2025

Description of changes

This PR introduces execution time tracing to read and write operations on collections. We introduce a request_execution_time_ns field to any tracing event whose request performs a read or write operation on a collection. These are bundled and emitted on the frontend.

TODO:

  • Calculate request execution time
  • Bundle execution time measurements on frontend to flush to S3 as floating point arrays

Test plan

  • Tests pass locally with :
    - [x] pytest for python,
    - [x] yarn test for js
    - [x] cargo test for rust

Documentation Changes

N/A

@c-gamble c-gamble requested a review from drewkim May 23, 2025 00:05
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@@ -589,6 +597,7 @@ impl ServiceBasedFrontend {
database_name,
source_collection_id,
target_collection_name,
received_at_timestamp_ns: request_received_at_timestamp_ns,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this feels a bit leaky to expose in all methods we call, I think we ought to think of a better abstraction/design here. Will think on this. cc @sanketkedia @Sicheng-Pan

Copy link
Contributor

@drewkim drewkim May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@HammadB HammadB May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an async runtime, so I doubt that works since tasks can move between threads. That also feels overly magic. I think we are just in need of some encapsulation or indirection.

Copy link
Contributor

@drewkim drewkim May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::task_local? https://docs.rs/tokio_wasi/latest/tokio/macro.task_local.html https://docs.rs/axum/0.7.7/axum/#using-tokios-task_local-macro

Edit:

I think we are just in need of some encapsulation or indirection.

Makes sense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -762,6 +780,7 @@ impl ServiceBasedFrontend {
documents,
uris,
metadatas,
received_at_timestamp_ns: request_received_at_timestamp_ns,
Copy link
Collaborator

@HammadB HammadB May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside - I suggest the use of instant for this sort of a task - https://users.rust-lang.org/t/why-is-std-instant-opaque/74786

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the current implementation relying on passing the start time through the request object, we need whatever type we use to be serializable via utoipa. we can opt for chrono::DateTimechrono::Utc as we figure out a better long-term metering solution.

@@ -623,6 +624,14 @@ impl ServiceBasedFrontend {
database: database_name,
collection_id: source_collection_id.0,
latest_collection_logical_size_bytes,
request_execution_time_ns: Some(
Copy link
Collaborator

@HammadB HammadB May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential option is for service based frontend to accept a partially built MeterEvent, and add the subsequent data to it and submit it. We can follow the https://cliffle.com/blog/rust-typestate/ and have partial types. This is a bit verbose so I don't love it but its an option.

@@ -631,6 +676,13 @@ impl Bindings {
database: String,
py: Python<'_>,
) -> ChromaPyResult<QueryResponse> {
let request_received_at_timestamp_ms = SystemTime::now()
.duration_since(UNIX_EPOCH)
// We use a `QueryError` here because by convention in the service implementations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment

@c-gamble c-gamble marked this pull request as ready for review May 23, 2025 23:42
Copy link
Contributor

Add Execution Time Tracing to Read/Write Operations

This PR introduces tracking of execution time for read and write operations in the Chroma database, adding a request_execution_time_ns field to relevant tracing MeterEvent variants. This field is propagated and emitted on the frontend, using chrono::DateTimechrono::Utc for timestamp tracking, with extensive plumbing to ensure all affected API requests include the request start timestamp. The changes span request/response data types, core logic in the Rust frontend implementation, and corresponding updates to Python bindings, testing infrastructure, and dependencies.

Key Changes:
• Introduced request_execution_time_ns (optional u128, serialized as hex) on CollectionFork, CollectionRead, and CollectionWrite tracing events.
• Tracked request start time (received_at_timestamp; chrono::DateTime) in all major API request structs and methods.
• Calculated execution duration at end of request handling (using Utc::now - request_received_at_timestamp) and placed result in MeterEvent.
• Propagated and required received_at_timestamp throughout Rust API layer, Rust-Python bindings, axum (REST) server handlers, and test helpers.
• Added a utility module for (de)serializing optional u128 as hex.
• Upgraded chrono and serde dependencies, updated struct and constructor signatures accordingly.
• Enhanced tests for new fields and proper serialization logic.

Affected Areas:
• rust/frontend/src/impls/service_based_frontend.rs
• rust/types/src/api_types.rs
• rust/types/src/u128.rs
• rust/tracing/src/meter_event.rs
• rust/python_bindings/src/bindings.rs
• rust/frontend/src/server.rs
• API request/response object definitions and tests
• Cargo.toml and dependencies
• Python and Rust test harnesses

Potential Impact:

Functionality: Execution time for read/write operations is now traceable on a per-request basis and included in MeterEvent telemetry, allowing for latency monitoring and eventual dashboards.

Performance: Negligible runtime overhead from capturing and storing request timestamps/durations per operation.

Security: No direct security impact; only tracking timing information.

Scalability: Minimal impact, but adds small metadata to every traced request. No major scalability concern.

Review Focus:
• Correctness of request timestamp propagation across all layers and interfaces.
• Accuracy and robustness of execution time calculation.
• Proper handling and serialization of optional u128 fields.
• Backward compatibility for older clients or omitted fields.
• Potential design improvements to avoid plumbing timestamps through all API layers.

Testing Needed

• Verify that all read/write operations (Add, Update, Upsert, Delete, Fork, Query, Get, Count) correctly record request_execution_time_ns in the tracing event.
• Ensure request start timestamp is accurately propagated from entrypoint to handler.
• Test new serde logic for optional u128 serialization/deserialization.
• Check that existing and new tests (including proptests) pass.
• Validate proper functioning across both REST and Python interfaces.

Code Quality Assessment

rust/python_bindings/src/bindings.rs: Python interface updated to pass through current Utc::now, with related plumbing.

rust/types/src/u128.rs: New module is clear, tested, and follows Rust/Serde conventions for (de)serializing optional u128 values as hex.

rust/tracing/src/meter_event.rs: Expanded for new field, correct serialization attributes, and new tests for serialization cases.

rust/frontend/src/impls/service_based_frontend.rs: Touches many methods to accept and forward timestamps; comments and reviewer feedback suggest design concerns but implementation is methodical.

rust/types/src/api_types.rs: All major API requests augmented for received_at_timestamp; constructors and validation extended accordingly.

Best Practices

Observability:
• Operation telemetry for latency tracing
• Structured event logging

Serialization:
• Interop via custom serde logic for optional u128+hex

Testing:
• Added/updated tests for new data flows and serialization correctness

Dependency Management:
• Synchronized chrono and serde versions across crates

Potential Issues

• Current design requires explicit timestamp argument/field in many places, which can be error-prone or brittle long-term.
• Some reviewers expressed concern about leaking internal timing details into public method signatures.
• Async context (lack of thread/task-local storage use) may pose future maintainability or ergonomic challenges.
• If a client fails to provide a timestamp, potential for incorrect execution time measurements.
• No fallback if system clock is incorrect or skewed (edge case).

This summary was automatically generated by @propel-code-bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants