Redesign CI #9

richiejp · 2024-04-02T14:37:42Z

There are a number of issues to break out here.

GPU testing
Speed up pulling containers
New control plane?

Possibly they all have the same solution or not, it needs to be investigated.

richiejp · 2024-04-03T06:16:11Z

Using GitHub's GPU beta runners the cost per minute is $0.07 and our e2e tests used to take 40 mins, so that is $2.8 per job. Per hour it is $4.2. Also these machines are based on Tesla T2's which are quite old now, although that is perhaps a good thing. Other CI providers have similar or greater costs. The price of an A16 machine on Vultr is $0.5 per hour for comparison, although we may have to manage creating and destroying instances.

That is CPU only as well, once we bring all the dependencies for GPU into the mix then a full run could take longer.

richiejp · 2024-04-03T07:52:02Z

V100 on DataCrunch (on demand) is $0.88 per hour

richiejp · 2024-04-03T15:37:48Z

One option would be to take a QEMU VM snapshot of a running k3s cluster with the NVIDIA operator installed. Then load the VM snapshot at the start of each test and do the install etc.

Problems:

Nested virt unless we are running on baremetal already
Host to guest communication could require some networking magic (in the past I have used MACVTAP/MACVLAN)

Pros:

Excluding virt and network requirements, not provider specific
Could be done locally
No boot or installation time for the OS or K3s

richiejp · 2024-04-04T08:20:04Z

QEMU supports port forwarding (I never knew)
K3sup for initial image creation or if we can't snapshot

richiejp · 2024-04-04T13:04:42Z

Seems there is only limited crossover between GPUs supporting virtualization/passthrough and GPUs supported by the operator. Pretty much limited to data center GPUs with passive cooling.

It's not workable at the moment see: #9

richiejp added this to the Post Open Source milestone Apr 2, 2024

richiejp mentioned this issue Apr 2, 2024

Convert to public CI #2

Closed

richiejp added the enhancement New feature or request label Apr 3, 2024

richiejp self-assigned this Apr 3, 2024

richiejp mentioned this issue Apr 5, 2024

Start of QEMU K3s based CI #13

Draft

5 tasks

richiejp added a commit that referenced this issue Apr 11, 2024

Remove e2e test for now

f52f865

It's not workable at the moment see: #9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign CI #9

Redesign CI #9

richiejp commented Apr 2, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 4, 2024

richiejp commented Apr 4, 2024

Redesign CI #9

Redesign CI #9

Comments

richiejp commented Apr 2, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 3, 2024

richiejp commented Apr 4, 2024

richiejp commented Apr 4, 2024