Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve group status metrics graphs and dashboard URLs #1694

Merged
merged 4 commits into from
Aug 14, 2024

Conversation

tylerwowen
Copy link
Contributor

@tylerwowen tylerwowen commented Aug 13, 2024

The old links are broken, so build new statsboard graphs using new metrics and new APIs.

The old graphs are not referencing the right metrics, so updated the metrics source. Count -> Rate.

Note that statsboard links are preferred over rendered graphs, and thus they are moved above the graphs.

Screenshot 2024-08-12 at 17 46 31

Group size
Provision Latency
Deploy Latency
Deploy success rate

@tylerwowen tylerwowen requested a review from a team as a code owner August 13, 2024 00:42
@github-actions github-actions bot added the deploy-board Includes changes to deploy-board label Aug 13, 2024
osoriano
osoriano previously approved these changes Aug 14, 2024
@tylerwowen tylerwowen merged commit 5fcda63 into master Aug 14, 2024
6 checks passed
@tylerwowen tylerwowen deleted the touyang/group_graphs branch August 14, 2024 21:29
@tylerwowen tylerwowen mentioned this pull request Aug 15, 2024
5 tasks
tylerwowen added a commit that referenced this pull request Aug 16, 2024
## Changes

## TAS
Previously launch latency was emitted by a Rodimus worker. This PR implements the same measure in the TAS. 

Launch latency is defined as the duration from host launch to the the complete of the first deployment. It measures for all environments deployed on new hosts. The configuration **Launch grace period** is the user set threshold for this latency and it's used for AgentJanitor and Rodimus health check.

Another fix is that I found all first deploy metrics were success. So I updated the condition when the `first_deploy` flag should be turned off. 

## UI
While updating the group status page to include the launch latency, I realized I forgot that the original plan was to create a dashboard. So I removed the links added in #1694, created a dashboard including the changes in a new Teletraan user dashboard.

Fixed launch failure rate graph and updated some metrics calculations.

<img width="1641" alt="image" src="https://github.com/user-attachments/assets/ef62bb3b-7748-4afa-b66a-f0a89111f8b6">

## Test plan
### TAS
1. Deploy this PR to TAS dev1
2. Launch a new host in tyler/test
    - [x] Launch latency should be emitted 
    - [x] First deploy counter should increment with success=true.
4. Deploy a bad build to tyler/test
5. Launch a new host
    - [x] Launch latency should be emitted
    - [x] First deploy counter should increment with success=false.

### UI
1. Deploy this PR to deploy-board dev1
2. Visit group status page /groups/tyler-test/
    - [x] verify the link is updated and working
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy-board Includes changes to deploy-board
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants