-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0009] RFC for centralized metrics storage and visualization. #28
base: main
Are you sure you want to change the base?
Conversation
### System Components | ||
|
||
- **Traefik**: Manages connectivity for all Docker containers. | ||
- **InfluxDB**: Acts as the centralized database for metrics storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why InfluxDB over using the storage built-in to Prometheus that we're already using in the protocol code. Should we instead try to use Prometheus in other services beyond the protocol code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus and InfluxDB are both highly respected in the monitoring and metrics world. Each tool has its strengths and use cases. However, for our specific requirements, InfluxDB in tandem with Grafana is more suitable. Here's I think why:
Purpose and Design
Prometheus
Prometheus is primarily a monitoring and alerting toolkit. Its main strength is in the collection and real-time processing of metrics.
It is designed for reliability and can operate with a minimal setup.
Its pull-based model is optimized for service discovery and runtime monitoring, scraping metrics from predefined endpoints.
It's an excellent choice for capturing short-term metrics and provides a powerful querying language (PromQL).
InfluxDB
InfluxDB is a purpose-built Time Series Database (TSDB). Its primary strength is in storing, retrieving, and performing operations on time series data.
It is optimized for high write loads and storage efficiency.
It supports long-term storage and can handle vast amounts of time series data without a hiccup.
InfluxQL and Flux provide comprehensive querying capabilities tailored for time-based datasets.
Data Storage and Retention
Prometheus
While Prometheus can retain data for longer periods, it's not its primary use case. Over extended periods, this can lead to challenges in storage management and efficiency.
It's more suited for shorter retention periods (typically hours to weeks).
InfluxDB
Designed for efficient storage and querying of time series data over long periods, making it suitable for our requirement of long-term storage.
Provides more flexibility and efficiency in data retention policies, down-sampling, and data lifecycle management.
Integration with Grafana
Both Prometheus and InfluxDB seamlessly integrate with Grafana, but when using InfluxDB, you get the benefit of a database built for the specific needs of time series visualization.
Scalability
Prometheus
Scaling Prometheus involves federation, where you have multiple Prometheus servers scraping targets.
For long-term storage, you might integrate it with external solutions, adding complexity.
InfluxDB
InfluxDB offers native clustering for horizontal scalability and high availability, making it easier to scale as your dataset grows.
In conclusion
Prometheus excels in real-time metrics collection and alerting, especially when you need to actively monitor and respond to system behaviors. However, for our needs – which revolve around storing metrics for extended periods and analyzing historical data – InfluxDB is a better fit. Paired with Grafana, it provides a powerful, scalable, and efficient solution for long-term metrics storage and visualization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a section on developer experience? I think it's important to have a concrete idea of how easy/hard it will be to integrate with this system for our engineers, and it would be good to re-analyse the alternatives with that aspect in mind too.
I've updated the PR description just in case to be on the same page when it comes to what this is all about. |
This is about different experiments and benchmarks results storage and visualization.
This is not about changes to, for example, existing Mina Daemon metrics exposure.