Switch to image pull if delta failure #2396

cywang117 · 2025-01-07T01:01:32Z

Revert to regular pull immediately on HTTP 4xx delta server response when initially requesting the delta resource.
Retry delta pull DELTA_APPLY_RETRY_COUNT (3) times when there's a delta failure during delta apply, such as in the case of Error: layers from manifest don't match image configuration balena-engine#244. After retry count, fall back to regular pull. Delta apply occurs with image pull, after successfully requesting the delta resource from the delta server. Errors with delta apply occur before or after image pull, see cases which raise the relevant Engine error here

Change-type: patch

src/lib/docker-utils.ts

pipex · 2025-01-13T13:54:53Z

src/lib/docker-utils.ts

-			return await fetchImageWithProgress(imgDest, deltaOpts, onProgress);
+		} else if (e instanceof DeltaServerError) {
+			logFn(
+				`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,


It might be better to be more concise with errors? The only thing I could think about is

Suggested change

`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,

`Falling back to regular pull due to delta server error (${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,

src/lib/docker-utils.ts

pipex · 2025-01-13T13:56:57Z

Nice work!

If the delta server responds immediately with HTTP 4xx upon requesting a delta image, this means the server is not able to supply the resource, so fall back to a regular pull immediately. Change-type: patch Signed-off-by: Christina Ying Wang <[email protected]>

…e reverting to regular pull This prevents an image download error loop where the delta image on the delta server is present, but some aspect of the delta image or the base image on the device does not match up, causing the delta to fail to be applied to the base image. Upon delta apply errors exceeding DELTA_APPLY_RETRY_COUNT, revert to a regular pull. Change-type: patch Signed-off-by: Christina Ying Wang <[email protected]>

cywang117 · 2025-01-13T18:40:07Z

Changes made @pipex , thank you!

pipex · 2025-01-14T13:35:48Z

src/lib/docker-utils.ts

-					id = await applyBalenaDelta(name, token, onProgress, logFn);
+					// Try to apply delta DELTA_APPLY_RETRY_COUNT times, then throw DeltaApplyError
+					let lastError: StatusError | undefined = undefined;
+					for (


Actually, I thought about it a bit more. There is already retry behavior builtin on the state engine. This is adding another retry loop on top of the existing one. The counter should live somewhere in memory and be increased when a new pull is retried and revert back to regular pull when the count has reached DELTA_APPLY_RETRY_COUNT

That code can probably live along the imageFetchFailures counter we already have here

balena-supervisor/src/compose/images.ts

Line 172 in dad6eee

if (imageFetchFailures[image.name] != null) {

pipex

I had approved but I had some additional thoughts this morning. See https://github.com/balena-os/balena-supervisor/pull/2396/files#r1914822271

pipex · 2025-01-14T17:44:48Z

src/lib/docker-utils.ts

+							if (!isStatusError(e)) {
+								throw e;
+							}
+							lastError = e;


This will also revert to regular pulls if the delta pull fails for any reason, including a network timeout. Do we know if there is something particular in the status returned by the engine when the 'layers from manifest...` error happens?

Short of running a regex match on the error message, I don't see how we would differentiate a layers from manifest error from a network issue (context exceeded / Client.Timeout / etc), as neither raise an HTTP status code.

Pattern for layers from manifest error doesn't include status code

Couldn't find any good examples of network error messages during pull, here's the closest one but there's no status code either

We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens

This might be a good place to start from https://docs.docker.com/reference/api/engine/version/v1.47/#tag/Image/operation/ImageCreate

We might need to take a look at the engine code to check for a list of possible HTTP responses when a pull happens

I've taken a look already when making this PR, the Engine code isn't specific about error codes when raising an errRootFSMismatch / layers from manifest error. The best way I can think of is to reproduce this specific issue and observe the HTTP codes, or do so next time a user is facing this issue, however we don't know how to reproduce.

My assumption that an errRootFSMismatch would raise a 4xx error whereas a network issue should be propagated by the Engine as a 5xx error, but we'd be guessing, even with the link to the docs which doesn't cover this case specifically -- deltas are only part of our Engine fork, after all.

This is a similar error that happens on delta pull and it can be replicated https://balena.fibery.io/favorites/Inputs/Pattern/2172

Thanks @pipex . I used this example to make the Supervisor log a more descriptive error. There isn't an HTTP status, the error is thrown from applyBalenaDelta, which is a function in docker-progress. Looks like a registry stream error should get propagated from docker-progress here, but it wraps the error from the registry stream in a generic error before propagating the error, which I believe swallows any other props the error might have had. For example:

export class StatusCodeError extends TypedError { public statusCode: number; constructor(message: string, statusCode: number) { super(message); this.name = "MyError"; this.statusCode = statusCode; } } const error = new StatusCodeError("Not found", 404); const wrapped = new Error(error as any); console.log({ message: wrapped.message, statusCode: (wrapped as any).statusCode, // statusCode was not added to the error stack: wrapped.stack });

I'm going to PR docker-progress once i get access to propagate the error directly if it's an instanceof NodeJS Error. Maybe that'll reveal more info, maybe that won't. Will report back

cywang117 requested a review from pipex January 7, 2025 01:01

flowzone-app bot enabled auto-merge January 7, 2025 01:35

pipex reviewed Jan 13, 2025

View reviewed changes

src/lib/docker-utils.ts Outdated Show resolved Hide resolved

pipex reviewed Jan 13, 2025

View reviewed changes

src/lib/docker-utils.ts Outdated Show resolved Hide resolved

cywang117 added 2 commits January 13, 2025 10:38

cywang117 force-pushed the switch-to-image-pull-if-delta-failure branch from dad6eee to 37b88f7 Compare January 13, 2025 18:39

pipex approved these changes Jan 13, 2025

View reviewed changes

pipex reviewed Jan 14, 2025

View reviewed changes

pipex requested changes Jan 14, 2025

View reviewed changes

pipex reviewed Jan 14, 2025

View reviewed changes

pipex mentioned this pull request Jan 16, 2025

Propagate NodeJS Error from stream directly balena-io-modules/docker-progress#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to image pull if delta failure #2396

Switch to image pull if delta failure #2396

cywang117 commented Jan 7, 2025 •

edited

Loading

pipex Jan 13, 2025 •

edited

Loading

pipex commented Jan 13, 2025

cywang117 commented Jan 13, 2025 •

edited

Loading

pipex Jan 14, 2025

pipex left a comment

pipex Jan 14, 2025

cywang117 Jan 14, 2025 •

edited

Loading

pipex Jan 14, 2025

pipex Jan 14, 2025

cywang117 Jan 14, 2025 •

edited

Loading

pipex Jan 14, 2025

cywang117 Jan 15, 2025

	`Falling back to regular pull due to delta server error (HTTP code ${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,
	`Falling back to regular pull due to delta server error (${e.statusCode})${e.statusMessage ? `: ${e.statusMessage}` : ''}`,

Switch to image pull if delta failure #2396

Are you sure you want to change the base?

Switch to image pull if delta failure #2396

Conversation

cywang117 commented Jan 7, 2025 • edited Loading

pipex Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

pipex commented Jan 13, 2025

cywang117 commented Jan 13, 2025 • edited Loading

pipex Jan 14, 2025

Choose a reason for hiding this comment

pipex left a comment

Choose a reason for hiding this comment

pipex Jan 14, 2025

Choose a reason for hiding this comment

cywang117 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

pipex Jan 14, 2025

Choose a reason for hiding this comment

pipex Jan 14, 2025

Choose a reason for hiding this comment

cywang117 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

pipex Jan 14, 2025

Choose a reason for hiding this comment

cywang117 Jan 15, 2025

Choose a reason for hiding this comment

cywang117 commented Jan 7, 2025 •

edited

Loading

pipex Jan 13, 2025 •

edited

Loading

cywang117 commented Jan 13, 2025 •

edited

Loading

cywang117 Jan 14, 2025 •

edited

Loading

cywang117 Jan 14, 2025 •

edited

Loading