Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified packet capture interface, take 2 #166

Merged
merged 472 commits into from
Aug 19, 2024
Merged

Unified packet capture interface, take 2 #166

merged 472 commits into from
Aug 19, 2024

Conversation

jaycedowell
Copy link
Member

@jaycedowell jaycedowell commented Mar 22, 2022

This PR provides a unified packet capture interface that makes it easier to add new packet formats to Bifrost and to read packet formats from the network (unicast or multicast) or from disk. This also includes a generalization of the packet writer interface. This PR changes the Bifrost API in a couple of ways:

bifrost.udp_capture.UDPCapture is now bifrost.packet_capture.UDPCapture
there is a new PacketCaptureCallback class that wraps the callbacks
bifrost.udp_transmit is now bifrost.packet_writer

This should also address the segfault I was running into in #137.

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Attention: Patch coverage is 67.88618% with 395 lines in your changes are missing coverage. Please review.

Project coverage is 67.77%. Comparing base (4fadfa5) to head (0c8f19b).
Report is 849 commits behind head on master.

❗ Current head 0c8f19b differs from pull request most recent head a360f0b. Consider uploading reports for the commit a360f0b to get more accurate results

Files Patch % Lines
python/bifrost/rdma.py 0.00% 122 Missing ⚠️
python/bifrost/telemetry/__init__.py 34.11% 112 Missing ⚠️
python/bifrost/version/__main__.py 0.00% 28 Missing ⚠️
python/bifrost/telemetry/__main__.py 0.00% 15 Missing ⚠️
python/bifrost/portaudio.py 20.00% 12 Missing ⚠️
python/bifrost/psrdada.py 0.00% 12 Missing ⚠️
python/bifrost/libbifrost.py 67.64% 11 Missing ⚠️
python/bifrost/pipeline.py 82.45% 10 Missing ⚠️
python/bifrost/DataType.py 76.31% 9 Missing ⚠️
python/bifrost/sigproc2.py 70.96% 9 Missing ⚠️
... and 27 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #166      +/-   ##
==========================================
+ Coverage   61.62%   67.77%   +6.15%     
==========================================
  Files          64       66       +2     
  Lines        5300     5741     +441     
==========================================
+ Hits         3266     3891     +625     
+ Misses       2034     1850     -184     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jaycedowell
Copy link
Member Author

Working on the mac support reminded me of something. When I first started working on this new interface I had a "sniff" mode which used linux's raw packet interface with a promiscuous NIC to capture packets destined for a different host. I originally did this because that is how we were running one of our data modes at LWA1. I've since moved on to multicast and I now wonder if "sniff" is still a feature worth having.

@jaycedowell
Copy link
Member Author

There is only one failure in test_udp_io for MacOS now but there are intermittent failures for Ubuntu. That's strange. I'm not sure if this is really better or not.

@jaycedowell
Copy link
Member Author

MacOS is better now but there still seems to be intermittent problems. They sometimes happen when I test locally but it isn't clear why.

@league
Copy link
Collaborator

league commented Jul 20, 2022

Just a note that I'm able to reproduce this error on my little MacBook Air:

FAIL: test_read_drx_single (test_udp_io.UDPIOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/league/p/bifrost/test/test_udp_io.py", line 408, in test_read_drx_single
    np.testing.assert_equal(final[i,...], data[i,...])
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 345, in assert_equal
    return assert_array_equal(actual, desired, err_msg, verbose)
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 934, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

Mismatched elements: 3871 / 8192 (47.3%)
 x: ndarray([[( 65,), (-46,), (-18,), ..., (-49,), ( 63,), ( 18,)],
         [( 65,), (-47,), ( -2,), ..., (-34,), ( 79,), (  2,)]],
        dtype=[('re_im', 'i1')])
 y: ndarray([[( 64,), (-30,), (-34,), ..., (-49,), ( 62,), ( 34,)],
         [( 49,), (-63,), ( -2,), ..., (-34,), ( 79,), (  2,)]],
        dtype=[('re_im', 'i1')])

Is that the only Mac error you're seeing on this branch now?

I think this faithfully brings the changes from `socket-deinline` and
`socket-reference` into `ibverb-support`. Test results seem to be
unchanged.

One change from `socket-reference` that may have been lost is that
we deleted the `pid_t` member from `BFudpcapture_impl` and
`BFudptransmit_impl` but it's still present in the corresponding
`packet_capture` and `packet_writer` objects.
@jaycedowell
Copy link
Member Author

jaycedowell commented Aug 4, 2022

That's interesting. I just pulled down the recent changes and gave it a shot on my laptop again. The first time I ended up with no failures but this warning:

test_read_multicast (test_udp_io.UDPIOTest) ... Exception in thread Thread-6:
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jaycedowell/CodeSafe/bifrost/test/test_udp_io.py", line 179, in main
    for iseq in self.ring.read(guarantee=True):
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/ring.py", line 116, in read
    with ReadSequence(self, which=whence, guarantee=guarantee) as cur_seq:
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/ring.py", line 255, in __init__
    _check(_bf.bfRingSequenceOpenEarliest(self.obj, ring.obj, guarantee))
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/libbifrost.py", line 117, in _check
    raise EndOfDataStop('BF_STATUS_END_OF_DATA')
bifrost.libbifrost.EndOfDataStop: BF_STATUS_END_OF_DATA
ok

Second run did not show that warnings and there we still no failures. Third run failed with:

======================================================================
FAIL: test_read_multicast (test_udp_io.UDPIOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jaycedowell/CodeSafe/bifrost/test/test_udp_io.py", line 545, in test_read_multicast
    np.testing.assert_equal(final[i,...], data[i,...])
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 345, in assert_equal
    return assert_array_equal(actual, desired, err_msg, verbose)
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 934, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

Mismatched elements: 15872 / 16384 (96.9%)
 x: ndarray([[( 47, -10), (-23,  27), (-13, -29), ..., (  7,  30),
          (-38, -19), ( 50,  -2)],
         [(  7,   0), (  7,   0), (  7,   0), ..., (  7,   0),...
 y: ndarray([[( 47, -10), (-23,  27), (-13, -29), ..., (  7,  30),
          (-38, -19), ( 50,  -2)],
         [( 50,   4), (-41,  17), ( 10, -29), ..., (-16,  28),...

----------------------------------------------------------------------
Ran 10 tests in 14.530s

FAILED (failures=1)

Fourth run also failed as the same place. Fifth , sixth, and seventh runs were all ok. Eighth failed. I don't really get it.

@league
Copy link
Collaborator

league commented Aug 4, 2022

Oof. I've seen the EndOfDataStop('BF_STATUS_END_OF_DATA') intermittently too, on Linux, although the test apparently passes. On Mac, it seems that I reliably fail test_read_drx_single. Earlier today, I think I saw that one fail on Linux too, but it was only once. test_read_multicast hasn't failed for me before.

I guess I've been gravitating toward a timing/race condition type of hypothesis. I notice in the send loops in the tests, there's a time.sleep. Is it just acting as a yield to give the other thread(s) a chance to run? But the buffer should be blocking… right? Both on write-when-full or read-when-empty? Or are we non-blocking and assuming the size of the buffer and thread fairness will make it mostly okay. I was going to play with the amounts of data (and/or size of buffer) to see if I can get it to fail more or less often.

@jaycedowell
Copy link
Member Author

I usually use a time.sleep in these contexts to get a more realistic packet/data rate coming out but these look like they might be being used as a crude rate limiter.

As for blocking/non-blocking the only thing that should block would be the ring that connects something like DRXReader with AccumulateOp. Each packet sending call should be returning as soon as the packets are handed off to the kernel. Your race condition idea might be on track. If you look at my last failure the first few elements in the comparison seem to be ok and then the end is all (7, 0).

@jaycedowell jaycedowell merged commit 970dd02 into master Aug 19, 2024
4 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants