-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0009] RFC for centralized metrics storage and visualization. #28
Open
shimkiv
wants to merge
1
commit into
main
Choose a base branch
from
0009-metrics
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# RFC for centralized metrics storage and visualization | ||
|
||
## Summary | ||
|
||
Proposing a standardized deployment stack for metrics storage and visualization using Docker Compose, Traefik, InfluxDB, and Grafana. This centralized system aims to serve as a one-stop solution for all teams in the company, ensuring uniformity, ease of access, and efficient management. | ||
|
||
## Motivation | ||
|
||
The growing number of teams and diverse metrics generated across various projects has created a need for a unified system where metrics can be stored, accessed, and visualized effectively. This proposed solution seeks to: | ||
|
||
- Simplify the deployment process across different teams. | ||
- Offer customization options tailored to specific requirements. | ||
- Optimize for ease of use, security, and scalability. | ||
|
||
## Detailed design | ||
|
||
### System Components | ||
|
||
- **Traefik**: Manages connectivity for all Docker containers. | ||
- **InfluxDB**: Acts as the centralized database for metrics storage. | ||
- **Grafana**: Offers visualization tools for the stored metrics data. | ||
|
||
#### Security implications | ||
|
||
The use of Traefik ensures secure HTTP routing to containers, and InfluxDB's token-based authentication strengthens data security. It is recommended to regularly update all components to their latest versions to mitigate potential vulnerabilities. | ||
|
||
#### Performance | ||
|
||
This stack is known for its high performance and can efficiently handle a large volume of metrics data without any notable lag or downtime. | ||
|
||
#### Impact on other systems | ||
|
||
This centralized solution will reduce the need for multiple deployments across various teams, thereby reducing system overheads and potential conflicts. | ||
|
||
### Edge Cases | ||
|
||
- Teams might have specific metrics that aren't readily compatible with InfluxDB; these will need custom solutions. | ||
- If one component fails, there's a risk of disruption across all teams. Regular maintenance and monitoring are crucial. | ||
|
||
### Example | ||
|
||
You can play with reference implementation using the following [link](https://o1labs-grafana.p42.xyz/d/H-L-ytP4k/snarkyjs-simple-benchmarking?orgId=1&refresh=10s&from=now-6M&to=now). | ||
Or, alternatively, you can deploy the stack locally using the following [example deployment descriptor](https://github.com/o1-labs/e2e-tests/blob/ec611dde718c74210a6b8ab9ed9fdd4828141ceb/docker-compose.yaml#L1). | ||
|
||
## Test plan and functional requirements | ||
|
||
- **Testing goals and objectives**: Ensure the centralized system is robust, secure, and accessible by all teams. | ||
- **Testing approach**: Conduct unit tests, integration tests, and performance tests using simulated metrics data. | ||
- **Testing scope**: Focus on data storage, retrieval, visualization, and security features. | ||
- **Testing requirements**: Ensure compatibility with diverse team metrics and compliance with company's data security standards. | ||
- **Testing resources**: A sandboxed environment resembling the production setup, dummy metrics data, and testing tools. | ||
|
||
## Drawbacks | ||
|
||
Having a single deployment can be a single point of failure if not managed and monitored effectively. Custom needs of specific teams might also require additional configurations or workarounds. | ||
|
||
## Rationale and alternatives | ||
|
||
The proposed design provides a seamless experience for all teams and centralizes metrics management. Alternatives might include using different tools or separate deployments for teams, but these can increase overheads and reduce uniformity. Not adopting this can lead to fragmented metrics storage, increased costs, and potential data silos. | ||
|
||
## Prior art | ||
|
||
Many organizations adopt centralized metrics systems for their teams to ensure streamlined processes. Such systems, using tools like Prometheus or ELK stack, have proven efficient in managing metrics at scale. Our proposal combines the best features of these systems while optimizing for our company's unique needs. | ||
|
||
## Unresolved questions | ||
|
||
- How will the transition process for teams currently using other systems be managed? | ||
- How will we handle potential system expansions or the addition of new tools in the future? | ||
- As any other tool, this one will need support and maintenance. We need to formally assign responsibilities. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why InfluxDB over using the storage built-in to Prometheus that we're already using in the protocol code. Should we instead try to use Prometheus in other services beyond the protocol code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus and InfluxDB are both highly respected in the monitoring and metrics world. Each tool has its strengths and use cases. However, for our specific requirements, InfluxDB in tandem with Grafana is more suitable. Here's I think why:
Purpose and Design
Prometheus
Prometheus is primarily a monitoring and alerting toolkit. Its main strength is in the collection and real-time processing of metrics.
It is designed for reliability and can operate with a minimal setup.
Its pull-based model is optimized for service discovery and runtime monitoring, scraping metrics from predefined endpoints.
It's an excellent choice for capturing short-term metrics and provides a powerful querying language (PromQL).
InfluxDB
InfluxDB is a purpose-built Time Series Database (TSDB). Its primary strength is in storing, retrieving, and performing operations on time series data.
It is optimized for high write loads and storage efficiency.
It supports long-term storage and can handle vast amounts of time series data without a hiccup.
InfluxQL and Flux provide comprehensive querying capabilities tailored for time-based datasets.
Data Storage and Retention
Prometheus
While Prometheus can retain data for longer periods, it's not its primary use case. Over extended periods, this can lead to challenges in storage management and efficiency.
It's more suited for shorter retention periods (typically hours to weeks).
InfluxDB
Designed for efficient storage and querying of time series data over long periods, making it suitable for our requirement of long-term storage.
Provides more flexibility and efficiency in data retention policies, down-sampling, and data lifecycle management.
Integration with Grafana
Both Prometheus and InfluxDB seamlessly integrate with Grafana, but when using InfluxDB, you get the benefit of a database built for the specific needs of time series visualization.
Scalability
Prometheus
Scaling Prometheus involves federation, where you have multiple Prometheus servers scraping targets.
For long-term storage, you might integrate it with external solutions, adding complexity.
InfluxDB
InfluxDB offers native clustering for horizontal scalability and high availability, making it easier to scale as your dataset grows.
In conclusion
Prometheus excels in real-time metrics collection and alerting, especially when you need to actively monitor and respond to system behaviors. However, for our needs – which revolve around storing metrics for extended periods and analyzing historical data – InfluxDB is a better fit. Paired with Grafana, it provides a powerful, scalable, and efficient solution for long-term metrics storage and visualization.