-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674
Comments
@mjh1 promise this is the last one, but it's quite a good one for you to take since it involves understanding some of the Broadcaster logs and how we record metrics |
success rate metrics Why? Improves our visibility and can compare better with Evan’s graph. Difficult to investigate right now |
@yondonfu do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use e.g. abd1334 |
@mjh1 Does that mean that if B starts multiple transcodes for segments concurrently, during the period of time until the transcodes complete we could see that success rate metric drop since |
ah yep you're totally right, will forget that idea |
Reopen until change deployed |
Describe the bug
A clear and concise description of what the bug is.
The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.
From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to
census.sendSuccess()
inmonitor/census.go
.I see this method being called in at least two places:
SegmentTranscodeFailed()
which is called whenever a transcode error is encounteredFor 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the
permanent
bool passed toSegmentTranscodeFailed()
. We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call tocensus.sendSuccess()
here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:To Reproduce
Steps to reproduce the behavior:
We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.
Expected behavior
A clear and concise description of what you expected to happen.
I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.
Generally, I see at least these categories of transcode errors that should cause success rate to drop:
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: