Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Enhance Monitoring for User-specific CPU and Disk Usage #187

Open
bendichter opened this issue Aug 6, 2024 · 1 comment
Open

Comments

@bendichter
Copy link
Member

Description:

We need a functional system for monitoring usage and cost by user, ideally with a no-code dashboard. This feature would empower us to manage resource allocation and open up registrations to new users more confidently.

Requirements:

  1. CPU and Disk Usage Monitoring by User:

    • Monitor disk usage and CPU usage (or relevant cost factors) for individual users.
    • While disk usage can be monitored with du checks, we need a way to generate reports over time, not just at an instant.
  2. Reporting and Analytics:

    • Provide reports on server options used and duration by user.
    • Create a system to monitor incremental and shared costs. This involves reporting the incremental cost for node creators and shared costs equally among node users.
  3. Dashboard:

    • Develop a no-code dashboard to visualize usage and cost data.
    • Include functionality to pre-set usage limits for users from these dashboard.
  4. Integration and Metrics:

    • Integrate with Graphana and Prometheus for improved metrics collection from AWS and other cloud vendors.
    • Ensure the system can handle cost anomaly detection.

Challenges:

  • Calculating "cost per user" is complex due to the shared nature of resources (e.g., multiple profiles on a single node).
  • Obtaining live cost information from AWS is challenging.
  • Supporting multiple cloud vendors adds another layer of complexity.

Proposed MVP:

  1. Metrics Collection:

    • Enhance AWS metrics collection to include hourly data instead of just daily totals.
  2. Disk Usage Monitoring:

    • Implement a disk usage monitoring and cleanup procedure.
  3. Cost Anomaly Detection:

    • Use existing tools (e.g., @satra 's anomaly detection system) for total cost anomaly detection.
  4. Graphana and Prometheus Integration:

    • Integrate with Graphana and Prometheus for comprehensive monitoring and alerting.

References:

This is a rough outline based on a convo with @asmacdo. Input and collaboration from the team will be crucial to refining the requirements to meet our needs.

@yarikoptic
Copy link
Member

might be worth investigating how/what nebari does that (#186).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants