Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNXRuntime_CUDA] Rebuild v1.10.0 #10476

Merged

Conversation

stemann
Copy link
Contributor

@stemann stemann commented Feb 7, 2025

No description provided.

@giordano
Copy link
Member

giordano commented Feb 7, 2025

Can you try to figure out what's the issue instead of reopening multiple PRs?

@giordano
Copy link
Member

giordano commented Feb 7, 2025

For example, if I were you I'd start by looking at where the message "Incomplete JLL release! Could not find tarball for" is coming from. Hint (I simply did a stupid search): https://github.com/JuliaPackaging/BinaryBuilder.jl/blob/edecf34fb8c7e8f7791b5e79116e701eb7a9274b/src/AutoBuild.jl#L1198. Can you think about under which conditions you get to that code path? And given those conditions, what's the likely cause for that to happen? And given that cause, how to address it?

@giordano
Copy link
Member

giordano commented Feb 7, 2025

The point is that I'm not going to merge this PR without understanding why previous registrations have failed, since the failure seems to be quite deterministic and reproducible, but no one has offered an explanation of what's going on and how to solve it and previous attempts at "just retrying" clearly haven't worked out, so I don't see why this time it should be any different.

I honestly have no clue of what's going on nor have the time to look into this, but you'll probably be the most motivated person to investigate the issue, so it'd be most useful if you did this work.

@stemann
Copy link
Contributor Author

stemann commented Feb 7, 2025

Sure - the reason I just opened a new PR, was an assumption that it was the CI system somehow removing the binaries in the time from last push to merge - and it would help to just rebuild and merge sooner.

@stemann
Copy link
Contributor Author

stemann commented Feb 7, 2025

OK, I find it a little suspicious that the ordering of platform tags cuda and cuda_platform in the triplet seems to not be completely consistent:

  1. The PR pipeline builds for e.g. aarch64-linux-gnu-cxx03-cuda+10.2-cuda_platform+jetson, cf. Generating meta.json
  2. The PR pipeline build for platform aarch64-linux-gnu-cxx03-cuda+10.2-cuda_platform+jetson, but reported Building for aarch64-linux-gnu-cxx03-cuda_platform+jetson-cuda+10.2, and generated the artifact ONNXRuntime_CUDA.v1.10.0.aarch64-linux-gnu-cxx03-cuda_platform+jetson-cuda+10.2.tar.gz (note the reverse order of the cuda, and cuda_platform, tags)

[ Info: Cloning wrapper code repo from https://github.com/JuliaBinaryWrappers/ONNXRuntime_CUDA_jll.jl into /cache/julia-buildkite-plugin/depots/e2fd9734-29d8-45cd-b0eb-59f7104f3131/dev/ONNXRuntime_CUDA_jll
2025-02-07 16:43:40 INFO Searching for artifacts: "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA*.tar.gz"
2025-02-07 16:43:40 INFO Found 10 artifacts. Starting to download to: /tmp/jl_ZXTlSk
...
2025-02-07 16:43:41 INFO Successfully downloaded "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA.v1.10.0.aarch64-linux-gnu-cxx11-cuda_platform+jetson-cuda+10.2.tar.gz" 21 MiB
2025-02-07 16:43:41 INFO Successfully downloaded "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA.v1.10.0.aarch64-linux-gnu-cxx03-cuda_platform+jetson-cuda+10.2.tar.gz" 21 MiB
2025-02-07 16:43:41 INFO Successfully downloaded "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA.v1.10.0.x86_64-linux-gnu-cxx03-cuda+11.3.tar.gz" 42 MiB
2025-02-07 16:43:41 INFO Successfully downloaded "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA.v1.10.0.x86_64-linux-gnu-cxx11-cuda+11.3.tar.gz" 42 MiB
2025-02-07 16:43:42 INFO Successfully downloaded "O/ONNXRuntime/ONNXRuntime_CUDA/products/ONNXRuntime_CUDA.v1.10.0.x86_64-w64-mingw32-cuda+11.3.tar.gz" 136 MiB
ERROR: LoadError: Incomplete JLL release! Could not find tarball for aarch64-linux-gnu-cxx03-cuda+10.2-cuda_platform+jetson

@giordano
Copy link
Member

giordano commented Feb 7, 2025

Ok, that's a decent lead. Parsing of triplets into Platform objects doesn't care about order of tags, but if somewhere there's a textual matching then that can be a problem, because of course strings wouldn't match

@giordano
Copy link
Member

giordano commented Feb 7, 2025

I guess next step is find out why/where platform tags are different and why that matters in the registration pipeline.

@stemann
Copy link
Contributor Author

stemann commented Feb 7, 2025

I believe the latter (the triplet string matching in the registration pipeline) is due to the occursin(".$(triplet(platform)).tar", f), where f is the tarball filename, in https://github.com/JuliaPackaging/BinaryBuilder.jl/blob/edecf34fb8c7e8f7791b5e79116e701eb7a9274b/src/AutoBuild.jl#L1175

Trying out BB-deploy to my own GitHub repo to pin-point the other issues...

@stemann
Copy link
Contributor Author

stemann commented Feb 10, 2025

@giordano Can we tag a new patch version of BB with JuliaPackaging/BinaryBuilder.jl#1363 ?

@giordano
Copy link
Member

#10498

@stemann stemann marked this pull request as ready for review February 10, 2025 16:34
@giordano giordano enabled auto-merge (squash) February 10, 2025 17:02
@giordano giordano merged commit 081828f into JuliaPackaging:master Feb 10, 2025
5 checks passed
@giordano
Copy link
Member

giordano commented Feb 10, 2025

You marked this as ready for review before the other PR was merged, right? That wasn't wise, now it has to build everything from scratch.

@stemann
Copy link
Contributor Author

stemann commented Feb 10, 2025

Ah - OK, I didn't know there were timing issues to be aware of in that case.

@giordano
Copy link
Member

That's documented: https://github.com/JuliaPackaging/Yggdrasil/blob/d0298176a7c9d6af659ecb18fd534da14442654c/CONTRIBUTING.md#understanding-build-cache-on-yggdrasil. Changing the manifest completely invalidates the cache that we use to avoid rebuilding on master.

@stemann stemann deleted the feature/onnxruntime_cuda_v1.10.0 branch February 10, 2025 17:48
@giordano
Copy link
Member

Anyway, this time it worked, thanks for fixing it! However I'm confused by https://buildkite.com/julialang/yggdrasil/builds/17508#0194f0dd-354a-46a6-94c0-7783debf209f/574-588

┌ Warning: Tarball filename does not match expected pattern: O

@stemann
Copy link
Contributor Author

stemann commented Feb 10, 2025

OK.

Thanks for pointing me in the right direction 😊

Yeah, I noticed those too - I guess somehow we are feeding a bit more than expected to the filter function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants