Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Parallelize downloadContainerdWasmShims function / fix download error bug #4126

Merged

Conversation

zachary-bailey
Copy link
Collaborator

@zachary-bailey zachary-bailey commented Feb 29, 2024

What type of PR is this?

/kind performance

What this PR does / why we need it:

This PR changes the execution of downloadContainerdWasmShims to make use of parallel processing. It increases the speed of the function by 5 minutes on average.

Which issue(s) this PR fixes:

This is one of a variety of incremental performance improvements that will ultimately increase the speed of VHD Build times.

Requirements:

@coveralls
Copy link

coveralls commented Feb 29, 2024

Pull Request Test Coverage Report for Build 8653558851

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 79.266%

Totals Coverage Status
Change from base Build 8652187564: 0.0%
Covered Lines: 2355
Relevant Lines: 2971

💛 - Coveralls

fi
done
wait ${pids[@]}
for shim_version in $CONTAINERD_WASM_VERSIONS; do
chmod 755 "$containerd_wasm_filepath/containerd-shim-wws-${binary_version}-v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, why can't this part be parallelized too? also i think you want to redeclare binary_version in the for loop here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilypan26 Hey! Thank you for the tip on binary_version. I initially took it out because I read the variable would still be in scope, but didn't take into account that its value needs to change as the for loop iterates so it can set permissions on each version.

For the permissions part, the first two downloads (spin / slight) are running sequentially and complete in about a second, so they will definitely be downloaded by the time the program reaches the code that sets permissions for their directories. For the last permission, I had to ensure that the 3 background processes (downloading shim_wws x 3) were complete before reaching that code, otherwise it could attempt to set permissions on a directory that doesn't exist on the VHD yet. I currently only have wws parallelized because downloading the first two and setting their permissions only takes about a second for each binary version.

If you think it makes more sense to parallelize everything for speed or stylistic reasons, please let me know! And thanks again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, then yea for safety we should only access the directory after we are sure it is created.

Although for spin and slight, even if they are much faster than wws i think it wouldn't hurt to parallelize those are well and then move all of the chmods to the for loop afterwards. This might be more consistent and readable, but up to you. If not, i would at least add a comment where you are downloading wws and explain that you are parallelizing wws download and only changing the permissions after you are sure the file is there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilypan26 I implemented your suggestions, thanks!

That being said, I noticed a flakey little error that has been happening for some time now (before my changes) regarding setting permissions. Im looking into it then will submit another PR.

@zachary-bailey zachary-bailey force-pushed the zacharybailey/downloadContainerdWasmShimsOptimization branch from 991b2e8 to f27b93e Compare March 11, 2024 16:02
if [[ $(isARM64) == 1 ]]; then
containerd_wasm_url="https://acs-mirror.azureedge.net/containerd-wasm-shims/${shim_version}/linux/arm64"
fi

if [ ! -f "$containerd_wasm_filepath/containerd-shim-spin-${shim_version}" ] || [ ! -f "$containerd_wasm_filepath/containerd-shim-slight-${shim_version}" ]; then
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-spin-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-spin-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT)
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-slight-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-slight-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT)
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-wws-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-wws-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT)
if [ "$shim_version" = "v0.8.0" ]; then
# Only download the wws shim for v0.8.0, wws shim for v0.3.0 and v0.5.1 fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those two versions in a separate PR? And share it with @Mossaka for a review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR for parallelization vs removal of component downloads that 404.
The latter might need a cleanup of the containerd config too.

Copy link
Collaborator Author

@zachary-bailey zachary-bailey Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that in baker.go, it actually accounts for the fact that wws is not available for v0.3.0 and v0.5.1. It makes no mention of wws until v0.8.0.

Relevant code in lines 1219 - 1237 & 1331 - 1349
https://github.com/Azure/AgentBaker/blob/master/pkg/agent/baker.go

Could this mean that the bug is confined to the downloadContainerdWasmShims function in cse_install.sh, since it doesn't account for wws not being available on v0.3.0 and v0.5.1 in the for loop, but baker.go seems to for the config?

If so, I can change the for loop to treat v0.8.0 as a special case and download wws for it. Or I can delete the other two versions entirely as originally requested. The parallelization PR is in the VHD Build Test Pipeline. @ganeshkumarashok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the versions that fail during the download would be cleaner as those versions will not be used

@zachary-bailey
Copy link
Collaborator Author

@Mossaka Good morning,

I am the engineer working with @ganeshkumarashok to decrease VHD build times in our pipelines.

Can you review this PR and let me know what you think? I removed v0-3-0 and v0-5-1 in baker.go, baker_test.go, and the cse_install.sh file. This is meant to be a fix until the appropriate versions can be put in. Additionally, I am running the downloadContainerdWasmShims function in parallel now in order to increase speed. It passed in the VHD Build Test pipeline with no errors.

The way the function and configs are currently written, the function fails for 5 minutes while it attempts to download wws-shims for v0-3-0 and v0-5-1. Please let me know if you have an input or suggestions! Thanks.

Copy link
Member

@Mossaka Mossaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

@@ -16,7 +16,7 @@ SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_DOWNLOAD_DIR="/opt/azure/tlsbootstrap"
SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_VERSION="v0.1.0-alpha.2"
TELEPORTD_PLUGIN_DOWNLOAD_DIR="/opt/teleportd/downloads"
TELEPORTD_PLUGIN_BIN_DIR="/usr/local/bin"
CONTAINERD_WASM_VERSIONS="v0.3.0 v0.5.1 v0.8.0"
CONTAINERD_WASM_VERSIONS="v0.8.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a follow-up PR to upgrade the shim to the latest version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want v0-8-0 switched to v0-11-1? Or do you want us to bump all versions up like below:

v0-3-0 ==> vX-X-X
v0-5-1 ==> vX-X-X
v0-8-0 ==> vX-X-X

If so, which versions would map to the older ones?

We are just looking to get the failing curl command to wws for v0-3-0 and v0-5-1 out of the code, as its adding 5 minutes to our build time due to retries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review

@@ -16,7 +16,7 @@ SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_DOWNLOAD_DIR="/opt/azure/tlsbootstrap"
SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_VERSION="v0.1.0-alpha.2"
TELEPORTD_PLUGIN_DOWNLOAD_DIR="/opt/teleportd/downloads"
TELEPORTD_PLUGIN_BIN_DIR="/usr/local/bin"
CONTAINERD_WASM_VERSIONS="v0.3.0 v0.5.1 v0.8.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any users on the older versions of the shim, say 0.3.0? By removing the older versions, are we going to break their existing workflows? If so, should we be more cautious about deprecating older versions and probably notife customers beforehand?

I am fine with deprecation in general but I prefer putting a notification to customers first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These older versions are getting a 404 on downloads during the VHD builds. Customers won't be affected/already must be affected in that case right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only case I can think of is users migrating from an older version working earlier to a new nodepool/VHD image that has the newer version. What would the expected behavior for that be, before this change (as that path existed, because the older version was not available for a while)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the notification to users, adding a note in the AKS release notes should be sufficient right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spin and slight are downloading successfully for v0-3-0 and v0-5-1, but not wws, which makes me think that wws isn't available for those.

Is it possible they are still using spin / slight for v0-3-0 and v0-5-1 without wws? @Mossaka

If not, then I don't think any customers would have been able to use those versions for quite some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible they are still using spin / slight for v0-3-0 and v0-5-1 without wws?

Yes, we should remove wws from v0.3.0 and v0.5.0 because it wasn't introduced for these versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe what you need is a condition logic saying that if the version if 0-3-0 or 0-5-0, don't install wws shim

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mossaka Good morning! I have implemented your suggestions, adding the v3 and v5 versions back in the baker.go/baker_test.go files and creating the conditional in cse_install. Would you mind giving this PR another look?

@ganeshkumarashok
Copy link
Contributor

Also @zachary-bailey, could you change the PR title to mention the deletion of the older version too, if you decide to keep both deletion and parallelization in the same PR?

@zachary-bailey zachary-bailey changed the title perf: Parallelize downloadContainerdWasmShims function perf: Parallelize downloadContainerdWasmShims function / delete older wasm shim versions Mar 13, 2024
@zachary-bailey zachary-bailey force-pushed the zacharybailey/downloadContainerdWasmShimsOptimization branch from fd7eb16 to 1fe6c04 Compare April 11, 2024 21:33
@zachary-bailey zachary-bailey merged commit 1211092 into master Apr 11, 2024
20 checks passed
@zachary-bailey zachary-bailey deleted the zacharybailey/downloadContainerdWasmShimsOptimization branch April 11, 2024 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants