Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test in debci #1076

Open
siretart opened this issue Sep 7, 2024 · 5 comments
Open

flaky test in debci #1076

siretart opened this issue Sep 7, 2024 · 5 comments

Comments

@siretart
Copy link

siretart commented Sep 7, 2024

I'm afraid that this is very similar to #433 but I can see it on amd64, more specifically, on Debian's CI system called debci. Here, the tests are running in a qemu VM with 4G of ram and 2 cores.

It seems to be quite reliably to break, but all failures are fairly inconsistent. Example runs I've observed so far:

I've managed to trigger it once on my laptop, by running the tests on four qemu machines in parallel. I've kept going running 5 machines in parallel, but could not reproduce anymore.

Any ideas/suggestions? Maybe I should increase the sleep statements in helpers.bash?

@siretart
Copy link
Author

siretart commented Sep 7, 2024

I've been informed that the debci logs are being recycled.
To preserve the logs, I've capture them here https://gist.github.com/siretart/19cfe578df59aa7a0158d9488ca0ce25

@siretart
Copy link
Author

siretart commented Sep 8, 2024

So, I've been staring at helpers.bash, and would like to run a thought by you:

All tests fail while running run_nc_test(). That function, in simple terms, opens two nc processes, one to listen on a port and the other to write to another port, potentially in different network namespaces, to test whether netavark is setting up the connection properly.

In all failing cases we observe that the assertion fails with the random string (genererated by

data=$(random_string)
) not matching the empty string). Here is an example output:

# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: ncat received data
# #| expected: 'MtMvuBHY3k'
# #|   actual: ''
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The assertion is not looking at the output directly. That's hard to do in bash. Instead, the assertion is reading from a temporary file (cf.

got=$(cat "$NETAVARK_TMPDIR/nc-out")
). Evidently, this file exists and is empty (otherwise we would expect a different output above).

So how can this be? Evidently the file got created when starting the first nc process, the one that is receiving data, at line

netavark/test/helpers.bash

Lines 647 to 648 in 4d896f6

nsenter -n -t "${CONTAINER_NS_PIDS[$container_ns]}" timeout --foreground -v --kill=10 5 \
nc $nc_common_args -l -p $container_port &>"$NETAVARK_TMPDIR/nc-out" <&$stdin &
. This bash invocation is starting the process and has it run in the background using the bash "job control" feature. This allows concurrent execution of the two nc processes.

My theory/thought is that the listening nc process may have received the string, and may have even written it to the filedescriptor associated with the temporary file, and may have issued a close() system call to have the associated buffers flushed, however, this operation of flushing of the i/o buffer and closing the file descriptor did not happen before the file is read in line 660.

If this theory is correct, then the issue can be solved by ensuring that the "job" started in line 647 terminates before the temporary file is read in line 660. Maybe moving up the closing of the file descriptor at line

netavark/test/helpers.bash

Lines 666 to 667 in 4d896f6

# close the fd
exec {stdin}>&-
would help? -- not sure. Ideally, we could signal the nc invocation that listens to the port to exit and make sure it gets the chance to write out and close the temporary file properly before test test reads it in line 663.

This theory could explain why this happens on test VMs more often than on developer laptops: Those worker VMs tend have a lot of load on their IO systems, leading to buffers being flushed slower than on unloaded developer systems. Moreover, I suspect that those test VMs don't run on SSDs, but on virtualized storage, that may not even be on the local machine: more opportunity for the i/o buffer to see delays before becoming available to other processes.

cc: @Luap99 @flouthoc

@Luap99
Copy link
Member

Luap99 commented Sep 8, 2024

Yeah all the ncat usage in bash is really a PITA to get this sync'ed up, the port listening part alone is already complex and the we kept getting broken ncat updates that broke our tests.

First is /tmp a tmpfs? We use a tmpfs and do not see such flakes here so if IO is the problem then I think this should fix it. I doubt you need to actually close/sync the file on tmpfs.

My theory/thought is that the listening nc process may have received the string, and may have even written it to the filedescriptor associated with the temporary file, and may have issued a close() system call to have the associated buffers flushed, however, this operation of flushing of the i/o buffer and closing the file descriptor did not happen before the file is read in line 660.

I think the race might be even worse, just because we write to the socket on the client doesn't mean the server side read it already. At least in the udp case the client exits before getting a confirmation from server as this is not a stream protocol so we just send one package of data and then exit. So that means by the time cat reads the file the server did not yet read or wrote the content to the file so yeah I think you are right.

If this theory is correct, then the issue can be solved by ensuring that the "job" started in line 647 terminates before the temporary file is read in line 660. Maybe moving up the closing of the file descriptor at line

Closing alone would not help, we would close then "wait" for the background job to finish.

Another way is to make the server bidirectional and send the msg back so we can compare the stdout from the ncat client. The nice thing about that is we actually ensure bidirectional traffic works.

Long term I was thinking of rewriting the tests in some higher level tool or language (e.g. python or go) because really we just need to setup a listener in a namespace and the whole ncat check if port is bound in background is just fragile. And the whole check json output isn't nice to use either.
But I guess given other priorities this is not going to happen anytime soon, if at all.

@siretart
Copy link
Author

siretart commented Sep 8, 2024

Long term I was thinking of rewriting the tests in some higher level tool or language (e.g. python or go) because really we just need to setup a listener in a namespace and the whole ncat check if port is bound in background is just fragile. And the whole check json output isn't nice to use either.

Just thinking out aloud here: Instead of completely rewriting the tests in python or go, maybe a more specialized tool that can run the send and receiver in two different network namespaces that are given on the command-line could be plugged into the run_nc_test() helper, and thus minimize the needed changes to the tests?

@Luap99
Copy link
Member

Luap99 commented Sep 20, 2024

Just thinking out aloud here: Instead of completely rewriting the tests in python or go, maybe a more specialized tool that can run the send and receiver in two different network namespaces that are given on the command-line could be plugged into the run_nc_test() helper, and thus minimize the needed changes to the tests?

Sure but this is the least of my concerns, the json string matching we do in bash is just not very sane and most of the stuff we do would just be much better is a sane language where you can actually manipulate json in a decent way.

Luap99 added a commit to Luap99/netavark that referenced this issue Sep 20, 2024
The background process may not yet have wirrten the output to the file
so we need to make sure it does first.

Fixes containers#1076

Signed-off-by: Paul Holzinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants