Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize the provision step #2468

Merged
merged 1 commit into from
Jan 5, 2024
Merged

Parallelize the provision step #2468

merged 1 commit into from
Jan 5, 2024

Conversation

happz
Copy link
Collaborator

@happz happz commented Nov 9, 2023

This change requires refactoring of current queue internals, to better support the third kind of parallelization & queueing. A custom queue task is needed, to run provisioning in parallel, and the step takes care of ordering of provision phases and actions.

Implements #2244

Pull Request Checklist

  • implement the feature
  • include a release note

@happz happz added step | provision Stuff related to the provision step area | multihost Issues related to the multihost testing support labels Nov 9, 2023
@happz happz added this to the 1.30 milestone Nov 9, 2023
@happz happz force-pushed the parallelize-provision branch 2 times, most recently from 3c87330 to 72c73fd Compare November 14, 2023 14:02
@happz happz marked this pull request as ready for review November 14, 2023 14:02
@lukaszachy
Copy link
Collaborator

Hm, I've tried with virtual and it doesn't work well.

/test:
    test: echo
    
/plan:
    execute:
        how: tmt
    discover:
        how: fmf
        
    provision:
        - how: virtual
        - how: virtual
        - how: virtual
$ tmt run --until report
/var/tmp/tmt/run-117

/p/plan
    discover
        how: fmf
        directory: /tmp/r_173145_H6I
        summary: 1 test selected
    provision
        queued provision task #1: default-0, default-1 and default-2
        
        provision task #1: default-0, default-1 and default-2
[default-0]         started
[default-0]         how: virtual
[default-1]         started
[default-0]         memory: 2048 megabyte
[default-0]         disk: 40 gigabyte
[default-1]         how: virtual
[default-2]         started
[default-1]         memory: 2048 megabyte
[default-1]         disk: 40 gigabyte
[default-2]         how: virtual
[default-2]         memory: 2048 megabyte
[default-2]         disk: 40 gigabyte
[default-0]         progress: booting...
[default-1]         progress: booting...
[default-2]         progress: booting...
libvirt: QEMU Driver error : internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:32.338285Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22'
Instance startup failed, retrying in 5 seconds...
libvirt: QEMU Driver error : internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22'
[default-0]         finished
[default-0]         fail: Failed to boot testcloud instance (internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22').
[default-2]         finished
[default-2]         multihost name: default-2
[default-2]         arch: x86_64
[default-2]         distro: Fedora Linux 39 (Cloud Edition)
[default-1]         finished
[default-1]         multihost name: default-1
[default-1]         arch: x86_64
[default-1]         distro: Fedora Linux 39 (Cloud Edition)

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    provision step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        Failed to boot testcloud instance (internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22').

There should be 3 machines running but virsh shows that one of them is paused.

@happz
Copy link
Collaborator Author

happz commented Nov 21, 2023

Hm, I've tried with virtual and it doesn't work well.

...
libvirt: QEMU Driver error : internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:32.338285Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22'
Instance startup failed, retrying in 5 seconds...
libvirt: QEMU Driver error : internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22'
[default-0] finished
[default-0] fail: Failed to boot testcloud instance (internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22').
[default-2] finished
[default-2] multihost name: default-2
[default-2] arch: x86_64
[default-2] distro: Fedora Linux 39 (Cloud Edition)
[default-1] finished
[default-1] multihost name: default-1
[default-1] arch: x86_64
[default-1] distro: Fedora Linux 39 (Cloud Edition)

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

provision step failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    Failed to boot testcloud instance (internal error: QEMU unexpectedly closed the monitor (vm='tmt-117-idTZSMZI'): 2023-11-21T16:35:37.575292Z qemu-system-x86_64: -netdev user,id=testcloud_net.10042,hostfwd=tcp::10042-:22: Could not set up host forwarding rule 'tcp::10042-:22').

There should be 3 machines running but virsh shows that one of them is paused.

Worked like a charm on my laptop :/ Can you check whether there's some libvirt/qemu/testcloud logging related to this? I'm not versed in Qemu things, but it looks like some kind of conflict. @frantisekz any idea what might be we hitting?

@lukaszachy
Copy link
Collaborator

@happz I'd say https://pagure.io/testcloud/blob/master/f/testcloud/util.py#_60 is not ready for parallel requests.

@frantisekz
Copy link
Collaborator

I'll take a look early December (currently on PTO without laptop).

@happz
Copy link
Collaborator Author

happz commented Nov 22, 2023

@happz I'd say https://pagure.io/testcloud/blob/master/f/testcloud/util.py#_60 is not ready for parallel requests.

Right, that might be the case, easily.

Can you try once more with the following patch? It should prevent concurrent calls into testcloud library:

Never mind, added it as a commit to this PR.

@psss psss changed the title Parallelize provisioning Parallelize the provision step Nov 22, 2023
@qcheng-redhat
Copy link
Contributor

I tried the latest patch with libvirt-provision plugin, it works. I got three guests provisioned parallelized.

My plan:

execute:
how: tmt

/plan:
provision:
- how: libvirt
guestargs:
--cpu host-model
--ram=4096
--vcpus=2
--graphics vnc,listen=0.0.0.0
distroname: xxxxxx
- how: libvirt
guestargs:
--cpu host-model
--ram=4096
--vcpus=2
--graphics vnc,listen=0.0.0.0
distroname: xxxxxx
- how: libvirt
guestargs:
--cpu host-model
--ram=4096
--vcpus=2
--graphics vnc,listen=0.0.0.0
distroname: xxxxxx
discover:
how: fmf

log:
04:47:17 queued provision task #1: default-0, default-1 and default-2
04:47:17
04:47:17 provision task #1: default-0, default-1 and default-2
04:47:17 [default-0] started
04:47:17 [default-0] how: libvirt
04:47:17 [default-1] started
04:47:17 [default-0] order: 50
04:47:17 [default-1] how: libvirt
04:47:17 [default-2] started
04:47:17 [default-1] order: 50
04:47:17 [default-2] how: libvirt
04:47:17 [default-2] order: 50
...
04:54:53
04:54:53 summary: 3 guests provisioned

@happz
Copy link
Collaborator Author

happz commented Nov 24, 2023

I tried the latest patch with libvirt-provision plugin, it works. I got three guests provisioned parallelized.

Glad to hear that, thanks!

/plan: provision: - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx discover: how: fmf

On another note: please, consider supporting hardware field, https://tmt.readthedocs.io/en/stable/spec/hardware.html, at least for the already supported requirements, cpu and memory. Plugins shipped with tmt would express part of the guestargs field from your example like this:

hardware:
  cpu:
    cores: 2
  memory: 4096 MB

I know it's probably of no immediate benefit to you now, and I would fully understand the hesitation or focus on more needed features for your use cases, but in the long run, it would help end-users if plugins would speak the same language. The field is supported by tmt's own plugins, by Testing Farm, and we're slowly extending the requirements supported by individual plugins. I suppose your plugin is very similar to virtual in tmt repo, it would transform the requirements from hardware field into options, e.g. https://github.com/teemtee/tmt/blob/main/tmt/steps/provision/testcloud.py#L489 demonstrates how we did this for virtual plugin. I'd be happy to provide any guidance if you decide t oextend your plugin.

@qcheng-redhat
Copy link
Contributor

I tried the latest patch with libvirt-provision plugin, it works. I got three guests provisioned parallelized.

Glad to hear that, thanks!

/plan: provision: - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx - how: libvirt guestargs: --cpu host-model --ram=4096 --vcpus=2 --graphics vnc,listen=0.0.0.0 distroname: xxxxxx discover: how: fmf

On another note: please, consider supporting hardware field, https://tmt.readthedocs.io/en/stable/spec/hardware.html, at least for the already supported requirements, cpu and memory. Plugins shipped with tmt would express part of the guestargs field from your example like this:

hardware:
  cpu:
    cores: 2
  memory: 4096 MB

I know it's probably of no immediate benefit to you now, and I would fully understand the hesitation or focus on more needed features for your use cases, but in the long run, it would help end-users if plugins would speak the same language. The field is supported by tmt's own plugins, by Testing Farm, and we're slowly extending the requirements supported by individual plugins. I suppose your plugin is very similar to virtual in tmt repo, it would transform the requirements from hardware field into options, e.g. https://github.com/teemtee/tmt/blob/main/tmt/steps/provision/testcloud.py#L489 demonstrates how we did this for virtual plugin. I'd be happy to provide any guidance if you decide t oextend your plugin.

Hi @happz ,

Thanks for the suggestion of using TMT hardware specification and sharing the virtual plugin example. Let's discuss this topic offline.

Thanks,
Qinghua

@psss psss linked an issue Nov 28, 2023 that may be closed by this pull request
@psss
Copy link
Collaborator

psss commented Nov 28, 2023

I've tried the parallel provision with the /tests/multihost/web test several times and it always failed with:

Failed to connect in 60s.

Twice it was just a single box which failed and once even two of them. @happz, have you seen anything like this?

@happz
Copy link
Collaborator Author

happz commented Nov 28, 2023

I've tried the parallel provision with the /tests/multihost/web test several times and it always failed with:

Failed to connect in 60s.

Twice it was just a single box which failed and once even two of them. @happz, have you seen anything like this?

Nope, I haven't seen this error.

Well, I'm afraid we should push this one out into 1.31. The PR is changing the whole step, you're hitting issues I haven't seen, there's barely any review, and it's Nov 28th :/

@psss psss modified the milestones: 1.30, 1.31 Nov 28, 2023
@happz
Copy link
Collaborator Author

happz commented Dec 5, 2023

/packit test --identifier full

tmt/queue.py Outdated Show resolved Hide resolved
tmt/steps/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Show resolved Hide resolved
@psss
Copy link
Collaborator

psss commented Jan 4, 2024

I've tried the parallel provision with the /tests/multihost/web test several times and it always failed with:

Failed to connect in 60s.

Twice it was just a single box which failed and once even two of them. @happz, have you seen anything like this?

Nope, I haven't seen this error.

Interesting. Tried today again and encountered similar behaviour. Seems that the problem happens randomly. Run four times:

  • httpd-server failed to connect in 60s.
  • httpd-server failed to connect in 60s.
  • curl-client failed to connect in 60s.
  • everything fine & tests passed

Looking through the log I see:

err: Warning: Permanently added '192.168.124.110' (ED25519) to the list of known hosts.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Permission denied, please try again.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Permission denied, please try again.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Received disconnect from 192.168.124.110 port 22:2: Too many authentication failures
err: Disconnected from 192.168.124.110 port 22

Which is weird because the command line is the same as for the other guests:

Run command: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oServerAliveInterval=60 -oServerAliveCountMax=5 -oIdentitiesOnly=yes -p22 -i /var/tmp/tmt/run-024/plan/provision/curl-client/id_ecdsa -S/run/user/12559/tmt/tmpyjjy9wtt [email protected] 'export TMT_PLAN_DATA=/var/tmp/tmt/run-024/plan/data; export TMT_PLAN_ENVIRONMENT_FILE=/var/tmp/tmt/run-024/plan/data/variables.env; export TMT_TREE=/var/tmp/tmt/run-024/plan/tree; export TMT_VERSION=1.30.0.dev64+g9df4fee9.d20240103; whoami'

Any idea what might be wrong?

@happz
Copy link
Collaborator Author

happz commented Jan 4, 2024

I've tried the parallel provision with the /tests/multihost/web test several times and it always failed with:

Failed to connect in 60s.

Twice it was just a single box which failed and once even two of them. @happz, have you seen anything like this?

Nope, I haven't seen this error.

Interesting. Tried today again and encountered similar behaviour. Seems that the problem happens randomly. Run four times:

  • httpd-server failed to connect in 60s.
  • httpd-server failed to connect in 60s.
  • curl-client failed to connect in 60s.
  • everything fine & tests passed

Looking through the log I see:

err: Warning: Permanently added '192.168.124.110' (ED25519) to the list of known hosts.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Permission denied, please try again.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Permission denied, please try again.
err: ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
err: Received disconnect from 192.168.124.110 port 22:2: Too many authentication failures
err: Disconnected from 192.168.124.110 port 22

Which is weird because the command line is the same as for the other guests:

Run command: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oServerAliveInterval=60 -oServerAliveCountMax=5 -oIdentitiesOnly=yes -p22 -i /var/tmp/tmt/run-024/plan/provision/curl-client/id_ecdsa -S/run/user/12559/tmt/tmpyjjy9wtt [email protected] 'export TMT_PLAN_DATA=/var/tmp/tmt/run-024/plan/data; export TMT_PLAN_ENVIRONMENT_FILE=/var/tmp/tmt/run-024/plan/data/variables.env; export TMT_TREE=/var/tmp/tmt/run-024/plan/tree; export TMT_VERSION=1.30.0.dev64+g9df4fee9.d20240103; whoami'

Any idea what might be wrong?

Stinks like a race condition, threads contesting shared resources or something similar. I'll try to debug it in the afternoon.

@psss
Copy link
Collaborator

psss commented Jan 4, 2024

Stinks like a race condition, threads contesting shared resources or something similar. I'll try to debug it in the afternoon.

Thanks. The detailed debug output from ssh looks like this:

debug1: Offering public key: /home/psss/.ssh/id_rsa RSA SHA256:OY7RDp7mZHS6OzXvc9hnQk2qaDCcOuFM2iaGD8GOrKc explicit agent
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug1: Offering public key: /home/psss/.ssh/1minutetip_rsa RSA SHA256:9j1blwt3wcrRiGYZQ7ZGu9axm3cDklH6/z4c+Ee8CzE explicit agent
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug1: Offering public key: /var/tmp/tmt/run-031/plan/provision/wget-client/id_ecdsa ECDSA SHA256:i3ivQgMXnx8vh5PJ/L1K0LKqIs5RToHLZqA+d0QCbtM explicit
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password
...

Seems like the provided key is not accepted.

@happz
Copy link
Collaborator Author

happz commented Jan 4, 2024

@psss hopefully addressed in 87ec95f: added more locking in testcloud plugin - not all calls into testcloud library were protected, and added _thread_safe flag for provisioning plugins. Since we don't control all libraries used by our provisioning plugins, we get more time to verify they do work when used in parallel provisioning.

Copy link
Collaborator

@psss psss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for implementing this! Looks good and works nicely. Added just some minor questions and suggestions.

tmt/queue.py Show resolved Hide resolved
tmt/queue.py Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Show resolved Hide resolved
tmt/steps/provision/__init__.py Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
tmt/steps/provision/__init__.py Outdated Show resolved Hide resolved
@psss
Copy link
Collaborator

psss commented Jan 4, 2024

This definitely deserves a mention in the release notes. Also, what do you think about adding a short paragraph with the overview of steps which support parallel execution to the Guide?

@psss psss self-assigned this Jan 4, 2024
@happz
Copy link
Collaborator Author

happz commented Jan 4, 2024

This definitely deserves a mention in the release notes.

Sure, added one in a8d369a.

Also, what do you think about adding a short paragraph with the overview of steps which support parallel execution to the Guide?

I mentioned compatible plugins in the release note, and I was hoping to make this information available on plugin pages, once #2549 lands

@psss
Copy link
Collaborator

psss commented Jan 5, 2024

Also, what do you think about adding a short paragraph with the overview of steps which support parallel execution to the Guide?

I mentioned compatible plugins in the release note, and I was hoping to make this information available on plugin pages, once #2549 lands

Makes sense, thanks.

Copy link
Collaborator

@psss psss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all issues. Looks good now. I've just slightly modified the release note formatting in 59681ab.

@psss
Copy link
Collaborator

psss commented Jan 5, 2024

Many tests are failing because of:

PullTask.__init__() got an unexpected keyword argument 'result'

@happz
Copy link
Collaborator Author

happz commented Jan 5, 2024

/packit test --identifier full

Many tests are failing because of:

PullTask.__init__() got an unexpected keyword argument 'result'

Adding b6a3250, let's see whether it helps.

@psss
Copy link
Collaborator

psss commented Jan 5, 2024

/packit test --identifier full

This change requires refactoring of current queue internals, to better
support the third kind of parallelization & queueing. A custom queue
task is needed, to run provisioning in parallel, and the step takes care
of ordering of provision phases and actions.
@psss
Copy link
Collaborator

psss commented Jan 5, 2024

/packit test --identifier full

@psss psss merged commit 8196422 into main Jan 5, 2024
18 checks passed
@psss psss deleted the parallelize-provision branch January 5, 2024 20:02
@guoguojenna
Copy link
Contributor

Seems beaker provisioning is still in sequence, any reasons?

Thanks.

@@ -680,6 +680,8 @@ class ProvisionBeaker(tmt.steps.provision.ProvisionPlugin[ProvisionBeakerData]):
_data_class = ProvisionBeakerData
_guest_class = GuestBeaker

# _thread_safe = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment it, use _thread_safe = True, beaker parallel provisioning works for me. Wondering why you comment it?

[client-1] guest: has been requested
[client-1] job id: 8784929
[client-1] wait: waiting for condition 'get_new_state' with timeout 1:00:00, deadline in 3600.0 seconds, checking every 60.00 seconds
[server-1] guest: has been requested
[server-1] job id: 8784930
[server-1] wait: waiting for condition 'get_new_state' with timeout 1:00:00, deadline in 3600.0 seconds, checking every 60.00 seconds
[client-2] guest: has been requested
[client-2] job id: 8784931
...
summary: 3 guests provisioned
client-2
client-1
server-1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are not sure whether the library tmt uses for Beaker provisioning, mrack, is thread-safe. We did run into issues with the virtual plugin, where the problem was eventually solved by adding an extra lock to provide serialization, but there was no time invested into doing the same for the Beaker plugin. If it's required, that is.

Filed #2607 to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area | multihost Issues related to the multihost testing support step | provision Stuff related to the provision step
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize the provision step
7 participants