o1-labs · shimkiv · Sep 18, 2023 · bkase · Sep 18, 2023 · shimkiv
@@ -0,0 +1,69 @@
+# RFC for centralized metrics storage and visualization
+
+## Summary
+
+Proposing a standardized deployment stack for metrics storage and visualization using Docker Compose, Traefik, InfluxDB, and Grafana. This centralized system aims to serve as a one-stop solution for all teams in the company, ensuring uniformity, ease of access, and efficient management.
+
+## Motivation
+
+The growing number of teams and diverse metrics generated across various projects has created a need for a unified system where metrics can be stored, accessed, and visualized effectively. This proposed solution seeks to:
+
+- Simplify the deployment process across different teams.
+- Offer customization options tailored to specific requirements.
+- Optimize for ease of use, security, and scalability.
+
+## Detailed design
+
+### System Components
+
+- **Traefik**: Manages connectivity for all Docker containers.
+- **InfluxDB**: Acts as the centralized database for metrics storage.
+- **Grafana**: Offers visualization tools for the stored metrics data.
+
+#### Security implications
+
+The use of Traefik ensures secure HTTP routing to containers, and InfluxDB's token-based authentication strengthens data security. It is recommended to regularly update all components to their latest versions to mitigate potential vulnerabilities.
+
+#### Performance
+
+This stack is known for its high performance and can efficiently handle a large volume of metrics data without any notable lag or downtime.
+
+#### Impact on other systems
+
+This centralized solution will reduce the need for multiple deployments across various teams, thereby reducing system overheads and potential conflicts.
+
+### Edge Cases
+
+- Teams might have specific metrics that aren't readily compatible with InfluxDB; these will need custom solutions.
+- If one component fails, there's a risk of disruption across all teams. Regular maintenance and monitoring are crucial.
+
+### Example
+
+You can play with reference implementation using the following [link](https://o1labs-grafana.p42.xyz/d/H-L-ytP4k/snarkyjs-simple-benchmarking?orgId=1&refresh=10s&from=now-6M&to=now).  
+Or, alternatively, you can deploy the stack locally using the following [example deployment descriptor](https://github.com/o1-labs/e2e-tests/blob/ec611dde718c74210a6b8ab9ed9fdd4828141ceb/docker-compose.yaml#L1).
+
+## Test plan and functional requirements
+
+- **Testing goals and objectives**: Ensure the centralized system is robust, secure, and accessible by all teams.
+- **Testing approach**: Conduct unit tests, integration tests, and performance tests using simulated metrics data.
+- **Testing scope**: Focus on data storage, retrieval, visualization, and security features.
+- **Testing requirements**: Ensure compatibility with diverse team metrics and compliance with company's data security standards.
+- **Testing resources**: A sandboxed environment resembling the production setup, dummy metrics data, and testing tools.
+
+## Drawbacks
+
+Having a single deployment can be a single point of failure if not managed and monitored effectively. Custom needs of specific teams might also require additional configurations or workarounds.
+
+## Rationale and alternatives
+
+The proposed design provides a seamless experience for all teams and centralizes metrics management. Alternatives might include using different tools or separate deployments for teams, but these can increase overheads and reduce uniformity. Not adopting this can lead to fragmented metrics storage, increased costs, and potential data silos.
+
+## Prior art
+
+Many organizations adopt centralized metrics systems for their teams to ensure streamlined processes. Such systems, using tools like Prometheus or ELK stack, have proven efficient in managing metrics at scale. Our proposal combines the best features of these systems while optimizing for our company's unique needs.
+
+## Unresolved questions
+
+- How will the transition process for teams currently using other systems be managed?
+- How will we handle potential system expansions or the addition of new tools in the future?
+- As any other tool, this one will need support and maintenance. We need to formally assign responsibilities.