P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

yondonfu · 2022-11-30T15:18:32Z

Describe the bug
A clear and concise description of what the bug is.

The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.

From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to census.sendSuccess() in monitor/census.go.

I see this method being called in at least two places:

In SegmentFullyTranscoded() which is called here after B finishes downloading all results from O
In segmentTranscodeFailed() which is called in SegmentTranscodeFailed() which is called whenever a transcode error is encountered

For 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the permanent bool passed to SegmentTranscodeFailed(). We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call to census.sendSuccess() here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:

When we hit the max # of transcode attempts which prompts B to give up on a segment here
When we hit a non-retryable transcode error which prompts B to give up on a segment here
When the caller (i.e. HTTP push client) gives up on the transcode resulting in a context cancellation here

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.

Expected behavior
A clear and concise description of what you expected to happen.

I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.

Generally, I see at least these categories of transcode errors that should cause success rate to drop:

If B hits the max # of transcode attempts - B should give up on the segment b/c it tried enough times already
If B hits a non-retryable error - B should give up on the segment b/c it knows that this segment likely just cannot be transcoded
If B knows that the caller (i.e. HTTP push client) is no longer waiting for a result - B should give up on the segment b/c it knows no one cares about the results anymore because the transcode was too slow

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

thomshutt · 2022-12-07T14:39:57Z

@mjh1 promise this is the last one, but it's quite a good one for you to take since it involves understanding some of the Broadcaster logs and how we record metrics

Quintendos · 2022-12-07T16:10:31Z

success rate metrics Why? Improves our visibility and can compare better with Evan’s graph. Difficult to investigate right now

mjh1 · 2022-12-12T11:10:50Z

@yondonfu do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

e.g. abd1334
but then I see in some cases the SegmentTranscoded metric is fired from the different transcoder implementations so maybe we would need to fire the transcode start metric there too.

yondonfu · 2023-01-04T18:14:47Z

do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

@mjh1 Does that mean that if B starts multiple transcodes for segments concurrently, during the period of time until the transcodes complete we could see that success rate metric drop since transcode_start would be incremented, but B is still waiting for the transcodes to complete? That's the main concern that comes to mind and reason for updating the success metric only when a transcode is complete (either with a success or a legit failure).

mjh1 · 2023-01-06T16:49:43Z

ah yep you're totally right, will forget that idea

mjh1 · 2023-01-13T10:06:59Z

Reopen until change deployed

github-actions bot added the status: triage this issue has not been evaluated yet label Nov 30, 2022

yondonfu added area: broadcasting area: metrics and removed status: triage this issue has not been evaluated yet labels Nov 30, 2022

Quintendos mentioned this issue Dec 5, 2022

Success rate metric only calculated based on NoOrchestrator transcode errors livepeer/catalyst-api#261

Closed

thomshutt assigned mjh1 Dec 7, 2022

Quintendos changed the title ~~Success rate metric only calculated based on NoOrchestrator transcode errors~~ P1 - Success rate metric only calculated based on NoOrchestrator transcode errors Dec 7, 2022

mjh1 mentioned this issue Dec 9, 2022

Transcode success rate metric fixes #2684

Merged

5 tasks

mjh1 closed this as completed in #2684 Jan 13, 2023

mjh1 reopened this Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

yondonfu commented Nov 30, 2022 •

edited

Loading

thomshutt commented Dec 7, 2022

Quintendos commented Dec 7, 2022

mjh1 commented Dec 12, 2022 •

edited

Loading

yondonfu commented Jan 4, 2023

mjh1 commented Jan 6, 2023

mjh1 commented Jan 13, 2023

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

P1 - Success rate metric only calculated based on NoOrchestrator transcode errors #2674

Comments

yondonfu commented Nov 30, 2022 • edited Loading

thomshutt commented Dec 7, 2022

Quintendos commented Dec 7, 2022

mjh1 commented Dec 12, 2022 • edited Loading

yondonfu commented Jan 4, 2023

mjh1 commented Jan 6, 2023

mjh1 commented Jan 13, 2023

yondonfu commented Nov 30, 2022 •

edited

Loading

mjh1 commented Dec 12, 2022 •

edited

Loading