Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0009] RFC for centralized metrics storage and visualization. #28

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions 0009-metrics-storage-and-visualization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# RFC for centralized metrics storage and visualization

## Summary

Proposing a standardized deployment stack for metrics storage and visualization using Docker Compose, Traefik, InfluxDB, and Grafana. This centralized system aims to serve as a one-stop solution for all teams in the company, ensuring uniformity, ease of access, and efficient management.

## Motivation

The growing number of teams and diverse metrics generated across various projects has created a need for a unified system where metrics can be stored, accessed, and visualized effectively. This proposed solution seeks to:

- Simplify the deployment process across different teams.
- Offer customization options tailored to specific requirements.
- Optimize for ease of use, security, and scalability.

## Detailed design

### System Components

- **Traefik**: Manages connectivity for all Docker containers.
- **InfluxDB**: Acts as the centralized database for metrics storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why InfluxDB over using the storage built-in to Prometheus that we're already using in the protocol code. Should we instead try to use Prometheus in other services beyond the protocol code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus and InfluxDB are both highly respected in the monitoring and metrics world. Each tool has its strengths and use cases. However, for our specific requirements, InfluxDB in tandem with Grafana is more suitable. Here's I think why:

Purpose and Design

Prometheus

Prometheus is primarily a monitoring and alerting toolkit. Its main strength is in the collection and real-time processing of metrics.
It is designed for reliability and can operate with a minimal setup.
Its pull-based model is optimized for service discovery and runtime monitoring, scraping metrics from predefined endpoints.
It's an excellent choice for capturing short-term metrics and provides a powerful querying language (PromQL).

InfluxDB

InfluxDB is a purpose-built Time Series Database (TSDB). Its primary strength is in storing, retrieving, and performing operations on time series data.
It is optimized for high write loads and storage efficiency.
It supports long-term storage and can handle vast amounts of time series data without a hiccup.
InfluxQL and Flux provide comprehensive querying capabilities tailored for time-based datasets.

Data Storage and Retention

Prometheus

While Prometheus can retain data for longer periods, it's not its primary use case. Over extended periods, this can lead to challenges in storage management and efficiency.
It's more suited for shorter retention periods (typically hours to weeks).

InfluxDB

Designed for efficient storage and querying of time series data over long periods, making it suitable for our requirement of long-term storage.
Provides more flexibility and efficiency in data retention policies, down-sampling, and data lifecycle management.

Integration with Grafana

Both Prometheus and InfluxDB seamlessly integrate with Grafana, but when using InfluxDB, you get the benefit of a database built for the specific needs of time series visualization.

Scalability

Prometheus

Scaling Prometheus involves federation, where you have multiple Prometheus servers scraping targets.
For long-term storage, you might integrate it with external solutions, adding complexity.

InfluxDB

InfluxDB offers native clustering for horizontal scalability and high availability, making it easier to scale as your dataset grows.

In conclusion

Prometheus excels in real-time metrics collection and alerting, especially when you need to actively monitor and respond to system behaviors. However, for our needs – which revolve around storing metrics for extended periods and analyzing historical data – InfluxDB is a better fit. Paired with Grafana, it provides a powerful, scalable, and efficient solution for long-term metrics storage and visualization.

- **Grafana**: Offers visualization tools for the stored metrics data.

#### Security implications

The use of Traefik ensures secure HTTP routing to containers, and InfluxDB's token-based authentication strengthens data security. It is recommended to regularly update all components to their latest versions to mitigate potential vulnerabilities.

#### Performance

This stack is known for its high performance and can efficiently handle a large volume of metrics data without any notable lag or downtime.

#### Impact on other systems

This centralized solution will reduce the need for multiple deployments across various teams, thereby reducing system overheads and potential conflicts.

### Edge Cases

- Teams might have specific metrics that aren't readily compatible with InfluxDB; these will need custom solutions.
- If one component fails, there's a risk of disruption across all teams. Regular maintenance and monitoring are crucial.

### Example

You can play with reference implementation using the following [link](https://o1labs-grafana.p42.xyz/d/H-L-ytP4k/snarkyjs-simple-benchmarking?orgId=1&refresh=10s&from=now-6M&to=now).
Or, alternatively, you can deploy the stack locally using the following [example deployment descriptor](https://github.com/o1-labs/e2e-tests/blob/ec611dde718c74210a6b8ab9ed9fdd4828141ceb/docker-compose.yaml#L1).

## Test plan and functional requirements

- **Testing goals and objectives**: Ensure the centralized system is robust, secure, and accessible by all teams.
- **Testing approach**: Conduct unit tests, integration tests, and performance tests using simulated metrics data.
- **Testing scope**: Focus on data storage, retrieval, visualization, and security features.
- **Testing requirements**: Ensure compatibility with diverse team metrics and compliance with company's data security standards.
- **Testing resources**: A sandboxed environment resembling the production setup, dummy metrics data, and testing tools.

## Drawbacks

Having a single deployment can be a single point of failure if not managed and monitored effectively. Custom needs of specific teams might also require additional configurations or workarounds.

## Rationale and alternatives

The proposed design provides a seamless experience for all teams and centralizes metrics management. Alternatives might include using different tools or separate deployments for teams, but these can increase overheads and reduce uniformity. Not adopting this can lead to fragmented metrics storage, increased costs, and potential data silos.

## Prior art

Many organizations adopt centralized metrics systems for their teams to ensure streamlined processes. Such systems, using tools like Prometheus or ELK stack, have proven efficient in managing metrics at scale. Our proposal combines the best features of these systems while optimizing for our company's unique needs.

## Unresolved questions

- How will the transition process for teams currently using other systems be managed?
- How will we handle potential system expansions or the addition of new tools in the future?
- As any other tool, this one will need support and maintenance. We need to formally assign responsibilities.