-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to image pull if delta failure #2396
base: master
Are you sure you want to change the base?
Conversation
src/lib/docker-utils.ts
Outdated
return await fetchImageWithProgress(imgDest, deltaOpts, onProgress); | ||
} else if (e instanceof DeltaServerError) { | ||
logFn( | ||
`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to be more concise with errors? The only thing I could think about is
`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`, | |
`Falling back to regular pull due to delta server error (${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`, |
Nice work! |
If the delta server responds immediately with HTTP 4xx upon requesting a delta image, this means the server is not able to supply the resource, so fall back to a regular pull immediately. Change-type: patch Signed-off-by: Christina Ying Wang <[email protected]>
…e reverting to regular pull This prevents an image download error loop where the delta image on the delta server is present, but some aspect of the delta image or the base image on the device does not match up, causing the delta to fail to be applied to the base image. Upon delta apply errors exceeding DELTA_APPLY_RETRY_COUNT, revert to a regular pull. Change-type: patch Signed-off-by: Christina Ying Wang <[email protected]>
dad6eee
to
37b88f7
Compare
Changes made @pipex , thank you! |
id = await applyBalenaDelta(name, token, onProgress, logFn); | ||
// Try to apply delta DELTA_APPLY_RETRY_COUNT times, then throw DeltaApplyError | ||
let lastError: StatusError | undefined = undefined; | ||
for ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I thought about it a bit more. There is already retry behavior builtin on the state engine. This is adding another retry loop on top of the existing one. The counter should live somewhere in memory and be increased when a new pull is retried and revert back to regular pull when the count has reached DELTA_APPLY_RETRY_COUNT
That code can probably live along the imageFetchFailures
counter we already have here
balena-supervisor/src/compose/images.ts
Line 172 in dad6eee
if (imageFetchFailures[image.name] != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had approved but I had some additional thoughts this morning. See https://github.com/balena-os/balena-supervisor/pull/2396/files#r1914822271
if (!isStatusError(e)) { | ||
throw e; | ||
} | ||
lastError = e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also revert to regular pulls if the delta pull fails for any reason, including a network timeout. Do we know if there is something particular in the status returned by the engine when the 'layers from manifest...` error happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short of running a regex match on the error message, I don't see how we would differentiate a layers from manifest
error from a network issue (context exceeded
/ Client.Timeout
/ etc), as neither raise an HTTP status code.
Pattern for layers from manifest error doesn't include status code
Couldn't find any good examples of network error messages during pull, here's the closest one but there's no status code either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good place to start from https://docs.docker.com/reference/api/engine/version/v1.47/#tag/Image/operation/ImageCreate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens
I've taken a look already when making this PR, the Engine code isn't specific about error codes when raising an errRootFSMismatch
/ layers from manifest
error. The best way I can think of is to reproduce this specific issue and observe the HTTP codes, or do so next time a user is facing this issue, however we don't know how to reproduce.
My assumption that an errRootFSMismatch
would raise a 4xx error whereas a network issue should be propagated by the Engine as a 5xx error, but we'd be guessing, even with the link to the docs which doesn't cover this case specifically -- deltas are only part of our Engine fork, after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar error that happens on delta pull and it can be replicated https://balena.fibery.io/favorites/Inputs/Pattern/2172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pipex . I used this example to make the Supervisor log a more descriptive error. There isn't an HTTP status, the error is thrown from applyBalenaDelta
, which is a function in docker-progress. Looks like a registry stream error should get propagated from docker-progress here, but it wraps the error from the registry stream in a generic error before propagating the error, which I believe swallows any other props the error might have had. For example:
export class StatusCodeError extends TypedError {
public statusCode: number;
constructor(message: string, statusCode: number) {
super(message);
this.name = "MyError";
this.statusCode = statusCode;
}
}
const error = new StatusCodeError("Not found", 404);
const wrapped = new Error(error as any);
console.log({
message: wrapped.message,
statusCode: (wrapped as any).statusCode, // statusCode was not added to the error
stack: wrapped.stack
});
I'm going to PR docker-progress once i get access to propagate the error directly if it's an instanceof NodeJS Error. Maybe that'll reveal more info, maybe that won't. Will report back
Change-type: patch