Skip to content

ci: use self-hosted runners #1258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

ci: use self-hosted runners #1258

wants to merge 4 commits into from

Conversation

avik-pal
Copy link
Collaborator

@avik-pal avik-pal commented May 5, 2025

No description provided.

@avik-pal
Copy link
Collaborator Author

avik-pal commented May 5, 2025

if this works I will disable the buildkite runners for now

@avik-pal avik-pal force-pushed the ap/self-hosted branch 2 times, most recently from c12e96c to ae6675a Compare May 5, 2025 19:28
@avik-pal avik-pal requested a review from wsmoses May 5, 2025 19:32
@avik-pal avik-pal force-pushed the ap/self-hosted branch 2 times, most recently from 1ab278a to 530511c Compare May 5, 2025 19:41
@avik-pal
Copy link
Collaborator Author

avik-pal commented May 5, 2025

clearly env vars were ignored...

@avik-pal
Copy link
Collaborator Author

avik-pal commented May 5, 2025

ok the runners are fully functional now!

@wsmoses wsmoses requested a review from giordano May 5, 2025 23:39
@wsmoses
Copy link
Member

wsmoses commented May 5, 2025

[2359728] signal 2: Interrupt
in expression starting at /home/wmoses/actions-runner/runner_7/_work/_temp/127d02ee-480f-471a-b54a-2a3f1a55d480:3
epoll_pwait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux.c:1404
uv_run at /workspace/srcdir/libuv/src/unix/core.c:430
ijl_task_get_next at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/scheduler.c:522
poptask at ./task.jl:1012
wait at ./task.jl:1021
#wait#731 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
wait at ./process.jl:694
wait at ./process.jl:687
unknown function (ip: 0x7fdd72df2f22)
subprocess_handler at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:2146
#131 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:2086
withenv at ./env.jl:265
#118 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1935
with_temp_env at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1793
#116 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1902
#mktempdir#28 at ./file.jl:819
unknown function (ip: 0x7fdd72dd5dfd)
mktempdir at ./file.jl:815
mktempdir at ./file.jl:815 [inlined]
#sandbox#115 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1849 [inlined]
sandbox at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1841
unknown function (ip: 0x7fdd72dc1ac6)
#test#128 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:2067
test at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:2011 [inlined]
#test#146 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:481
test at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:460
unknown function (ip: 0x7fdd72d6405d)
#test#77 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:159
unknown function (ip: 0x7fdd72d5717d)
test at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:148
#test#79 at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:174
test at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Pkg/src/API.jl:165
unknown function (ip: 0x7fdd792f5b72)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
_include at ./loading.jl:2794
include at ./Base.jl:557
jfptr_include_46879.1 at /home/wmoses/actions-runner/runner_7/_work/_tool/julia/1.11.5/x64/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:323
_start at ./client.jl:531
jfptr__start_73430.1 at /home/wmoses/actions-runner/runner_7/_work/_tool/julia/1.11.5/x64/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
true_main at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/jlapi.c:1059
main at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))
Allocations: 18978437 (Pool: 18971320; Big: 7117); GC: 38
val already in a list
atexit hook threw an error: ErrorException("schedule: Task not runnable")
error at ./error.jl:35
#schedule#761 at ./task.jl:884
schedule at ./task.jl:876 [inlined]
uv_writecb_task at ./stream.jl:1200
jfptr_uv_writecb_task_66985.1 at /home/wmoses/actions-runner/runner_7/_work/_tool/julia/1.11.5/x64/lib/julia/sys.so (unknown line)
jlcapi_uv_writecb_task_67448.1 at /home/wmoses/actions-runner/runner_7/_work/_tool/julia/1.11.5/x64/lib/julia/sys.so (unknown line)
uv__write_callbacks at /workspace/srcdir/libuv/src/unix/stream.c:926
uv__stream_io at /workspace/srcdir/libuv/src/unix/stream.c:1227
uv__run_pending at /workspace/srcdir/libuv/src/unix/core.c:824
uv_run at /workspace/srcdir/libuv/src/unix/core.c:420
ijl_task_get_next at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/scheduler.c:522
poptask at ./task.jl:1012
wait at ./task.jl:1021
uv_write at ./stream.jl:1081
unsafe_write at ./stream.jl:1154
unsafe_write at ./io.jl:452 [inlined]
write at ./strings/io.jl:248 [inlined]
print at ./strings/io.jl:250
unknown function (ip: 0x7fdd72df3566)
showerror at ./errorshow.jl:152
unknown function (ip: 0x7fdd72df34f6)
_atexit at ./initdefs.jl:462
2025-05-05 23:21:35.532624: E external/xla/xla/service/slow_operation_alarm.cc:140] The operation took 2.101807905s

we really need to find and fix this

@giordano
Copy link
Member

giordano commented May 5, 2025

That's simply hanging and the job is being terminated because it timed out.

JULIA_PKG_SERVER_REGISTRY_PREFERENCE: eager
ENABLE_PJRT_COMPATIBILITY: 1
REACTANT_TEST_GROUP: ${{ matrix.test_group }}
XLA_FLAGS: "--xla_force_host_platform_device_count=12"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this? It wasn't set in the buildkite pipeline, as far as I can tell.

@@ -160,3 +160,86 @@ jobs:
- uses: codecov/codecov-action@v5
with:
files: lcov.info

test-cuda:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm not a huge fan or repeating the same workflow all over the place, a reusable workflow like what we have in the GB-25 repo would reduce duplication quite a lot. But this can be done in a followup, I guess now it's more important to make sure the workflow actually works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have create a common reusable workflow now

@avik-pal
Copy link
Collaborator Author

avik-pal commented May 6, 2025

seems like XLA is being super slow for some reason

@giordano
Copy link
Member

giordano commented May 6, 2025

Bus error on dlopen sounds fun:

[2887384] signal 7 (2): Bus error
in expression starting at /home/wmoses/.julia/packages/CUDA_Runtime_jll/iU7vK/.pkg/platform_augmentation.jl:95
unknown function (ip: 0x7f52d640a21f)
unknown function (ip: 0x7f52d63f151e)
unknown function (ip: 0x7f52d63f3274)
unknown function (ip: 0x7f52d63fdd41)
_dl_catch_exception at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f52d63fd8f9)
unknown function (ip: 0x7f52d63d3257)
_dl_catch_exception at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_dl_catch_error at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f52d63d3a64)
dlopen at /lib/x86_64-linux-gnu/libdl.so.2 (unknown line)
ijl_load_dynamic_library at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/dlload.c:365
#dlopen#3 at ./libdl.jl:120
jfptr_YY.dlopenYY.3_50348.1 at /home/wmoses/actions-runner/runner_0/_work/_tool/julia/1.11.5/x64/lib/julia/sys.so (unknown line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants