-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Parallelize downloadContainerdWasmShims function / fix download error bug #4126
perf: Parallelize downloadContainerdWasmShims function / fix download error bug #4126
Conversation
Pull Request Test Coverage Report for Build 8653558851Details
💛 - Coveralls |
fi | ||
done | ||
wait ${pids[@]} | ||
for shim_version in $CONTAINERD_WASM_VERSIONS; do | ||
chmod 755 "$containerd_wasm_filepath/containerd-shim-wws-${binary_version}-v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, why can't this part be parallelized too? also i think you want to redeclare binary_version
in the for loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilypan26 Hey! Thank you for the tip on binary_version. I initially took it out because I read the variable would still be in scope, but didn't take into account that its value needs to change as the for loop iterates so it can set permissions on each version.
For the permissions part, the first two downloads (spin / slight) are running sequentially and complete in about a second, so they will definitely be downloaded by the time the program reaches the code that sets permissions for their directories. For the last permission, I had to ensure that the 3 background processes (downloading shim_wws x 3) were complete before reaching that code, otherwise it could attempt to set permissions on a directory that doesn't exist on the VHD yet. I currently only have wws parallelized because downloading the first two and setting their permissions only takes about a second for each binary version.
If you think it makes more sense to parallelize everything for speed or stylistic reasons, please let me know! And thanks again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, then yea for safety we should only access the directory after we are sure it is created.
Although for spin and slight, even if they are much faster than wws i think it wouldn't hurt to parallelize those are well and then move all of the chmods to the for loop afterwards. This might be more consistent and readable, but up to you. If not, i would at least add a comment where you are downloading wws and explain that you are parallelizing wws download and only changing the permissions after you are sure the file is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilypan26 I implemented your suggestions, thanks!
That being said, I noticed a flakey little error that has been happening for some time now (before my changes) regarding setting permissions. Im looking into it then will submit another PR.
991b2e8
to
f27b93e
Compare
if [[ $(isARM64) == 1 ]]; then | ||
containerd_wasm_url="https://acs-mirror.azureedge.net/containerd-wasm-shims/${shim_version}/linux/arm64" | ||
fi | ||
|
||
if [ ! -f "$containerd_wasm_filepath/containerd-shim-spin-${shim_version}" ] || [ ! -f "$containerd_wasm_filepath/containerd-shim-slight-${shim_version}" ]; then | ||
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-spin-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-spin-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT) | ||
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-slight-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-slight-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT) | ||
retrycmd_if_failure 30 5 60 curl -fSLv -o "$containerd_wasm_filepath/containerd-shim-wws-${binary_version}-v1" "$containerd_wasm_url/containerd-shim-wws-v1" 2>&1 | tee $CURL_OUTPUT >/dev/null | grep -E "^(curl:.*)|([eE]rr.*)$" && (cat $CURL_OUTPUT && exit $ERR_KRUSTLET_DOWNLOAD_TIMEOUT) | ||
if [ "$shim_version" = "v0.8.0" ]; then | ||
# Only download the wws shim for v0.8.0, wws shim for v0.3.0 and v0.5.1 fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove those two versions in a separate PR? And share it with @Mossaka for a review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR for parallelization vs removal of component downloads that 404.
The latter might need a cleanup of the containerd config too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems that in baker.go, it actually accounts for the fact that wws is not available for v0.3.0 and v0.5.1. It makes no mention of wws until v0.8.0.
Relevant code in lines 1219 - 1237 & 1331 - 1349
https://github.com/Azure/AgentBaker/blob/master/pkg/agent/baker.go
Could this mean that the bug is confined to the downloadContainerdWasmShims function in cse_install.sh, since it doesn't account for wws not being available on v0.3.0 and v0.5.1 in the for loop, but baker.go seems to for the config?
If so, I can change the for loop to treat v0.8.0 as a special case and download wws for it. Or I can delete the other two versions entirely as originally requested. The parallelization PR is in the VHD Build Test Pipeline. @ganeshkumarashok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting the versions that fail during the download would be cleaner as those versions will not be used
616c5b5
to
635fd14
Compare
@Mossaka Good morning, I am the engineer working with @ganeshkumarashok to decrease VHD build times in our pipelines. Can you review this PR and let me know what you think? I removed v0-3-0 and v0-5-1 in baker.go, baker_test.go, and the cse_install.sh file. This is meant to be a fix until the appropriate versions can be put in. Additionally, I am running the downloadContainerdWasmShims function in parallel now in order to increase speed. It passed in the VHD Build Test pipeline with no errors. The way the function and configs are currently written, the function fails for 5 minutes while it attempts to download wws-shims for v0-3-0 and v0-5-1. Please let me know if you have an input or suggestions! Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
@@ -16,7 +16,7 @@ SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_DOWNLOAD_DIR="/opt/azure/tlsbootstrap" | |||
SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_VERSION="v0.1.0-alpha.2" | |||
TELEPORTD_PLUGIN_DOWNLOAD_DIR="/opt/teleportd/downloads" | |||
TELEPORTD_PLUGIN_BIN_DIR="/usr/local/bin" | |||
CONTAINERD_WASM_VERSIONS="v0.3.0 v0.5.1 v0.8.0" | |||
CONTAINERD_WASM_VERSIONS="v0.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a follow-up PR to upgrade the shim to the latest version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want v0-8-0 switched to v0-11-1? Or do you want us to bump all versions up like below:
v0-3-0 ==> vX-X-X
v0-5-1 ==> vX-X-X
v0-8-0 ==> vX-X-X
If so, which versions would map to the older ones?
We are just looking to get the failing curl command to wws for v0-3-0 and v0-5-1 out of the code, as its adding 5 minutes to our build time due to retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review
@@ -16,7 +16,7 @@ SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_DOWNLOAD_DIR="/opt/azure/tlsbootstrap" | |||
SECURE_TLS_BOOTSTRAP_KUBELET_EXEC_PLUGIN_VERSION="v0.1.0-alpha.2" | |||
TELEPORTD_PLUGIN_DOWNLOAD_DIR="/opt/teleportd/downloads" | |||
TELEPORTD_PLUGIN_BIN_DIR="/usr/local/bin" | |||
CONTAINERD_WASM_VERSIONS="v0.3.0 v0.5.1 v0.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any users on the older versions of the shim, say 0.3.0? By removing the older versions, are we going to break their existing workflows? If so, should we be more cautious about deprecating older versions and probably notife customers beforehand?
I am fine with deprecation in general but I prefer putting a notification to customers first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These older versions are getting a 404 on downloads during the VHD builds. Customers won't be affected/already must be affected in that case right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only case I can think of is users migrating from an older version working earlier to a new nodepool/VHD image that has the newer version. What would the expected behavior for that be, before this change (as that path existed, because the older version was not available for a while)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the notification to users, adding a note in the AKS release notes should be sufficient right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spin and slight are downloading successfully for v0-3-0 and v0-5-1, but not wws, which makes me think that wws isn't available for those.
Is it possible they are still using spin / slight for v0-3-0 and v0-5-1 without wws? @Mossaka
If not, then I don't think any customers would have been able to use those versions for quite some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible they are still using spin / slight for v0-3-0 and v0-5-1 without wws?
Yes, we should remove wws from v0.3.0 and v0.5.0 because it wasn't introduced for these versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe what you need is a condition logic saying that if the version if 0-3-0 or 0-5-0, don't install wws shim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mossaka Good morning! I have implemented your suggestions, adding the v3 and v5 versions back in the baker.go/baker_test.go files and creating the conditional in cse_install. Would you mind giving this PR another look?
Also @zachary-bailey, could you change the PR title to mention the deletion of the older version too, if you decide to keep both deletion and parallelization in the same PR? |
fd7eb16
to
1fe6c04
Compare
What type of PR is this?
/kind performance
What this PR does / why we need it:
This PR changes the execution of downloadContainerdWasmShims to make use of parallel processing. It increases the speed of the function by 5 minutes on average.
Which issue(s) this PR fixes:
This is one of a variety of incremental performance improvements that will ultimately increase the speed of VHD Build times.
Requirements: