flaky test in debci #1076

siretart · 2024-09-07T19:38:07Z

I'm afraid that this is very similar to #433 but I can see it on amd64, more specifically, on Debian's CI system called debci. Here, the tests are running in a qemu VM with 4G of ram and 2 cores.

It seems to be quite reliably to break, but all failures are fairly inconsistent. Example runs I've observed so far:

I've managed to trigger it once on my laptop, by running the tests on four qemu machines in parallel. I've kept going running 5 machines in parallel, but could not reproduce anymore.

Any ideas/suggestions? Maybe I should increase the sleep statements in helpers.bash?

The text was updated successfully, but these errors were encountered:

siretart · 2024-09-07T22:54:18Z

I've been informed that the debci logs are being recycled.
To preserve the logs, I've capture them here https://gist.github.com/siretart/19cfe578df59aa7a0158d9488ca0ce25

siretart · 2024-09-08T13:23:55Z

So, I've been staring at helpers.bash, and would like to run a thought by you:

All tests fail while running run_nc_test(). That function, in simple terms, opens two nc processes, one to listen on a port and the other to write to another port, potentially in different network namespaces, to test whether netavark is setting up the connection properly.

In all failing cases we observe that the assertion fails with the random string (genererated by

netavark/test/helpers.bash

Line 660 in 4d896f6

data=$(random_string)

) not matching the empty string). Here is an example output:

# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: ncat received data
# #| expected: 'MtMvuBHY3k'
# #|   actual: ''
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The assertion is not looking at the output directly. That's hard to do in bash. Instead, the assertion is reading from a temporary file (cf.

netavark/test/helpers.bash

Line 663 in 4d896f6

got=$(cat "$NETAVARK_TMPDIR/nc-out")

). Evidently, this file exists and is empty (otherwise we would expect a different output above).

So how can this be? Evidently the file got created when starting the first nc process, the one that is receiving data, at line

netavark/test/helpers.bash

Lines 647 to 648 in 4d896f6

    
           nsenter -n -t "${CONTAINER_NS_PIDS[$container_ns]}" timeout --foreground -v --kill=10 5 \ 
        
               nc $nc_common_args -l -p $container_port &>"$NETAVARK_TMPDIR/nc-out" <&$stdin &

. This bash invocation is starting the process and has it run in the background using the bash "job control" feature. This allows concurrent execution of the two nc processes.

My theory/thought is that the listening nc process may have received the string, and may have even written it to the filedescriptor associated with the temporary file, and may have issued a close() system call to have the associated buffers flushed, however, this operation of flushing of the i/o buffer and closing the file descriptor did not happen before the file is read in line 660.

If this theory is correct, then the issue can be solved by ensuring that the "job" started in line 647 terminates before the temporary file is read in line 660. Maybe moving up the closing of the file descriptor at line

netavark/test/helpers.bash

Lines 666 to 667 in 4d896f6

    
           # close the fd 
        
           exec {stdin}>&-

would help? -- not sure. Ideally, we could signal the nc invocation that listens to the port to exit and make sure it gets the chance to write out and close the temporary file properly before test test reads it in line 663.

This theory could explain why this happens on test VMs more often than on developer laptops: Those worker VMs tend have a lot of load on their IO systems, leading to buffers being flushed slower than on unloaded developer systems. Moreover, I suspect that those test VMs don't run on SSDs, but on virtualized storage, that may not even be on the local machine: more opportunity for the i/o buffer to see delays before becoming available to other processes.

cc: @Luap99 @flouthoc

Luap99 · 2024-09-08T17:05:38Z

Yeah all the ncat usage in bash is really a PITA to get this sync'ed up, the port listening part alone is already complex and the we kept getting broken ncat updates that broke our tests.

First is /tmp a tmpfs? We use a tmpfs and do not see such flakes here so if IO is the problem then I think this should fix it. I doubt you need to actually close/sync the file on tmpfs.

My theory/thought is that the listening nc process may have received the string, and may have even written it to the filedescriptor associated with the temporary file, and may have issued a close() system call to have the associated buffers flushed, however, this operation of flushing of the i/o buffer and closing the file descriptor did not happen before the file is read in line 660.

I think the race might be even worse, just because we write to the socket on the client doesn't mean the server side read it already. At least in the udp case the client exits before getting a confirmation from server as this is not a stream protocol so we just send one package of data and then exit. So that means by the time cat reads the file the server did not yet read or wrote the content to the file so yeah I think you are right.

If this theory is correct, then the issue can be solved by ensuring that the "job" started in line 647 terminates before the temporary file is read in line 660. Maybe moving up the closing of the file descriptor at line

Closing alone would not help, we would close then "wait" for the background job to finish.

Another way is to make the server bidirectional and send the msg back so we can compare the stdout from the ncat client. The nice thing about that is we actually ensure bidirectional traffic works.

Long term I was thinking of rewriting the tests in some higher level tool or language (e.g. python or go) because really we just need to setup a listener in a namespace and the whole ncat check if port is bound in background is just fragile. And the whole check json output isn't nice to use either.
But I guess given other priorities this is not going to happen anytime soon, if at all.

siretart · 2024-09-08T17:56:59Z

Long term I was thinking of rewriting the tests in some higher level tool or language (e.g. python or go) because really we just need to setup a listener in a namespace and the whole ncat check if port is bound in background is just fragile. And the whole check json output isn't nice to use either.

Just thinking out aloud here: Instead of completely rewriting the tests in python or go, maybe a more specialized tool that can run the send and receiver in two different network namespaces that are given on the command-line could be plugged into the run_nc_test() helper, and thus minimize the needed changes to the tests?

Luap99 · 2024-09-20T14:21:07Z

Just thinking out aloud here: Instead of completely rewriting the tests in python or go, maybe a more specialized tool that can run the send and receiver in two different network namespaces that are given on the command-line could be plugged into the run_nc_test() helper, and thus minimize the needed changes to the tests?

Sure but this is the least of my concerns, the json string matching we do in bash is just not very sane and most of the stuff we do would just be much better is a sane language where you can actually manipulate json in a decent way.

The background process may not yet have wirrten the output to the file so we need to make sure it does first. Fixes containers#1076 Signed-off-by: Paul Holzinger <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test in debci #1076

flaky test in debci #1076

siretart commented Sep 7, 2024

siretart commented Sep 7, 2024

siretart commented Sep 8, 2024

Luap99 commented Sep 8, 2024

siretart commented Sep 8, 2024

Luap99 commented Sep 20, 2024

flaky test in debci #1076

flaky test in debci #1076

Comments

siretart commented Sep 7, 2024

siretart commented Sep 7, 2024

siretart commented Sep 8, 2024

Luap99 commented Sep 8, 2024

siretart commented Sep 8, 2024

Luap99 commented Sep 20, 2024