Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to image pull if delta failure #2396

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cywang117
Copy link
Contributor

@cywang117 cywang117 commented Jan 7, 2025

  • Revert to regular pull immediately on HTTP 4xx delta server response when initially requesting the delta resource.
  • Retry delta pull DELTA_APPLY_RETRY_COUNT (3) times when there's a delta failure during delta apply, such as in the case of Error: layers from manifest don't match image configuration balena-engine#244. After retry count, fall back to regular pull. Delta apply occurs with image pull, after successfully requesting the delta resource from the delta server. Errors with delta apply occur before or after image pull, see cases which raise the relevant Engine error here

Change-type: patch

@cywang117 cywang117 requested a review from pipex January 7, 2025 01:01
@flowzone-app flowzone-app bot enabled auto-merge January 7, 2025 01:35
src/lib/docker-utils.ts Outdated Show resolved Hide resolved
return await fetchImageWithProgress(imgDest, deltaOpts, onProgress);
} else if (e instanceof DeltaServerError) {
logFn(
`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,
Copy link
Contributor

@pipex pipex Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to be more concise with errors? The only thing I could think about is

Suggested change
`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,
`Falling back to regular pull due to delta server error (${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,

src/lib/docker-utils.ts Outdated Show resolved Hide resolved
@pipex
Copy link
Contributor

pipex commented Jan 13, 2025

Nice work!

If the delta server responds immediately with HTTP 4xx upon requesting a delta image,
this means the server is not able to supply the resource, so fall back to a regular pull
immediately.

Change-type: patch
Signed-off-by: Christina Ying Wang <[email protected]>
…e reverting to regular pull

This prevents an image download error loop where the delta image on the delta server is present,
but some aspect of the delta image or the base image on the device does not match up, causing
the delta to fail to be applied to the base image.

Upon delta apply errors exceeding DELTA_APPLY_RETRY_COUNT, revert to a regular pull.

Change-type: patch
Signed-off-by: Christina Ying Wang <[email protected]>
@cywang117 cywang117 force-pushed the switch-to-image-pull-if-delta-failure branch from dad6eee to 37b88f7 Compare January 13, 2025 18:39
@cywang117
Copy link
Contributor Author

cywang117 commented Jan 13, 2025

Changes made @pipex , thank you!

id = await applyBalenaDelta(name, token, onProgress, logFn);
// Try to apply delta DELTA_APPLY_RETRY_COUNT times, then throw DeltaApplyError
let lastError: StatusError | undefined = undefined;
for (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I thought about it a bit more. There is already retry behavior builtin on the state engine. This is adding another retry loop on top of the existing one. The counter should live somewhere in memory and be increased when a new pull is retried and revert back to regular pull when the count has reached DELTA_APPLY_RETRY_COUNT

That code can probably live along the imageFetchFailures counter we already have here

if (imageFetchFailures[image.name] != null) {

Copy link
Contributor

@pipex pipex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had approved but I had some additional thoughts this morning. See https://github.com/balena-os/balena-supervisor/pull/2396/files#r1914822271

if (!isStatusError(e)) {
throw e;
}
lastError = e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also revert to regular pulls if the delta pull fails for any reason, including a network timeout. Do we know if there is something particular in the status returned by the engine when the 'layers from manifest...` error happens?

Copy link
Contributor Author

@cywang117 cywang117 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short of running a regex match on the error message, I don't see how we would differentiate a layers from manifest error from a network issue (context exceeded / Client.Timeout / etc), as neither raise an HTTP status code.

Pattern for layers from manifest error doesn't include status code

Couldn't find any good examples of network error messages during pull, here's the closest one but there's no status code either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@cywang117 cywang117 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens

I've taken a look already when making this PR, the Engine code isn't specific about error codes when raising an errRootFSMismatch / layers from manifest error. The best way I can think of is to reproduce this specific issue and observe the HTTP codes, or do so next time a user is facing this issue, however we don't know how to reproduce.

My assumption that an errRootFSMismatch would raise a 4xx error whereas a network issue should be propagated by the Engine as a 5xx error, but we'd be guessing, even with the link to the docs which doesn't cover this case specifically -- deltas are only part of our Engine fork, after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar error that happens on delta pull and it can be replicated https://balena.fibery.io/favorites/Inputs/Pattern/2172

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pipex . I used this example to make the Supervisor log a more descriptive error. There isn't an HTTP status, the error is thrown from applyBalenaDelta, which is a function in docker-progress. Looks like a registry stream error should get propagated from docker-progress here, but it wraps the error from the registry stream in a generic error before propagating the error, which I believe swallows any other props the error might have had. For example:

export class StatusCodeError extends TypedError {
    public statusCode: number;
    constructor(message: string, statusCode: number) {
        super(message);
        this.name = "MyError";
        this.statusCode = statusCode;
    }
}

const error = new StatusCodeError("Not found", 404);
const wrapped = new Error(error as any);

console.log({
    message: wrapped.message,
    statusCode: (wrapped as any).statusCode, // statusCode was not added to the error
    stack: wrapped.stack
});

I'm going to PR docker-progress once i get access to propagate the error directly if it's an instanceof NodeJS Error. Maybe that'll reveal more info, maybe that won't. Will report back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants