add Device tests for close and threading crashes #809

diablodale · 2023-04-13T12:52:27Z

~~Not ready to merge as it is Windows-only as I refine new tests.~~

Adds two unit tests to expose crashing bugs related to [REGRESSION] Device APIs crash application when called after Device::close() #805 and RPC serialization and its mutex locks and/or shared_ptr use are unclear and likely crash #807.
Updates the cpp file to use Catch2
Minor changes to existing 3rd tests to remove unneeded code and handle timeouts/missing hardware more cleanly

Test Multiple devices created and destroyed in parallel will usualy crash when 3 USB devices are attached. If it doesn't run the test again until it crashes.

Test Device APIs after Device::close() will always crash, SEG fault, etc.

themarpe

Thanks!

Just wondering on rpcClient - when is that now destructed - such that streams and other resources are destructed also (and that we bring still mitigate the initial issue that kept the device in an odd state).

Otherwise 👍 on this, targeting to bring in for next release

diablodale · 2023-04-24T13:45:04Z

Just wondering on rpcClient - when is that now destructed - such that streams and other resources are destructed also (and that we bring still mitigate the initial issue that kept the device in an odd state).

My current approach is pimpl->rpcClient is never explicitely destructed. Instead, it is destucted when pimpl is destructed as part of Device destruct

depthai-core rules/contract require we have a callable rpcClient to live past close()
rpcClient has a member shared_ptr<stream> which contains a member shared_ptr<xlink connection>
the xlink connection is closed

depthai-core/src/device/DeviceBase.cpp

Line 412 in 125feb8

connection->close();

at the start of the Device() close process. Assuming XLink is thread-safe, and XLinkConnection::close() is well written, the close of the Xlink connection can happen anytime and API calls to anyone will nicely fail/throw.
when xlink connection is closed, then xlink calls made by rpcClient (or anyone else) should fail with xlink errors or throw similar to\ "Device already closed"

I can not (yet) find any sideaffect that rpcClient would cause to keep "the device in an odd state". @szabi-luxonis did the work on 4b3c687. I would like to understand their evidence or thinking on that. My suspicion is that it was other work (like the device firmware) that made a difference -or- there is a bug in the XLinkConnection::close() codepath causing an issue. I don't think rpcClient itself is the problem.

For example, in XLinkConnection::close() is a code comment you wrote // TODO(themarpe) - revisit for TCPIP protocol. That seems relevant to PoE devices. 🤔 What is that about?
xlinkconnection::close() does...

locks a mutex
calls XLink with long multi-seconds of work
sets deviceLinkId = -1
calls XLink again with long multi-seconds of work
finally sets closed = true;
releases mutex

If anyone were to call any XLink apis like Queue or a Device api that uses rpcClient, then since XLink is thread-safe, then the apis should nicely return an XLink error which is caught and rethrown. The customer that opened the triggering issue was using PoE. That makes be think again this is a timing/race issue because PoE has the long multi-second start/stop/reboot times.

I get suspicious off the following with a PoE device:

Someone calls Device::closeimpl()
connection->close() which starts the blocking multi-second work above
in parallel the Device is still running watchdogThread and monitorThread. If they attempt their tests during xlinkconnection::close() I think it is possible in certain scenarios for connection->close(); to be called by the monitor. Which could block due to the mutex still being held by step 2. Perhaps this cascade of blocks causes a problem or "odd state".

Step 3 was one of my concerns in #520 when my approach of atomic<bool> connection::closed was replaced with a blocking mutex. The scenarios are so many and varied it is somewhat difficult to think through them all. Which the atomic approach supports as it is non-blocking. This may not be the specific cause of this specific PoE problem. Yet, I think it is problematic to use a mutex and maybe it contributes to these types of race/timing fails. Personally, I don't like a public ::close() on any of these objects. They seem to cause more problems than the limited utility.

themarpe · 2023-04-24T20:56:01Z

I can not (yet) find any sideaffect that rpcClient would cause to keep "the device in an odd state". @szabi-luxonis did the work on 4b3c687. I would like to understand their evidence or thinking on that. My suspicion is that it was other work (like the device firmware) that made a difference -or- there is a bug in the XLinkConnection::close() codepath causing an issue. I don't think rpcClient itself is the problem.

I think at the end of the day it boiled down to whether that rpcStream was destructed, in turn the underlying XLinkStream was closed.

Though I cannot confirm this to be the case. Would have to be retested prehaps with original post change vs just closing the XLinkStream (instead of "nulling out the rpcClient). In that way the rpcClient could remain valid and issues with interfacing with post close would not be required.

For example, in XLinkConnection::close() is a code comment you wrote // TODO(themarpe) - revisit for TCPIP protocol. That seems relevant to PoE devices. thinking What is that about?

This waits till USB device is back online or times out - it helps with doing device down/up events as otherwise issues could occur (afaik while device was still (on usb level) destructing, it'd reconnect). The TODO is there as this does not happen for PoE devices (as the "bootDevice" is never true in that case)

Which the atomic approach supports as it is non-blocking. This may not be the specific cause of this specific PoE problem. Yet, I think it is problematic to use a mutex and maybe it contributes to these types of race/timing fails.

I agree that /w atomic this would make these simpler - I am up for changing that back, but IMO the crux here still lies somewhere around the XLinkStream being closed or not.

diablodale · 2023-04-25T13:45:59Z

I'm not arguing 🙂 just trying to understand the thinking/logic as it isn't aligning with the code I see.
Why does it matter if the underlying XLinkStream of rpcClient was closed?

Anytime Device::close() or Device::~Device is called then the first meanful line of code in DeviceBase::closeImpl() is XLinkConnection::close(). It blocks doing work like XLinkResetRemoteTimeout(). The monitor thread with failing pings also calls XLinkConnection::close(). Both of their calls to XLinkConnection::close() call XLinkResetRemoteTimeout().

XLinkResetRemoteTimeout() causes hardware "down" aka getXLinkState(link) != XLINK_UP. All XLink* apis calls check to see if the stream or link given as a param is XLINK_UP and return XLINK errors when false. And these XLINK errors are caught in depthai-core code as return values to those XLink* apis. This keeps the host in a well-defined state.

The call to XLinkResetRemoteTimeout() also sends a XLINK_RESET_REQ message to the remote OAK hardware for it to down itself and may force it with DispatcherDeviceFdDown(). Finally, it XLink_sem_wait(&link->dispatcherClosedSem) waits for the dispatcher to be closed. None of that codepath needs any XLinkStream to be closed.

That's my reasoning of the code...which I definitely can be wrong 😅. I can't align my reasoning with why you want the underlying XLinkStream of rpcClient to be closed...and why you have a belief it puts "the device in an odd state".

I don't have visibility into the OAK device firmware. The firmware must support any disconnect type at any time regardless of the state of XLinkStream's. Meaning the firmware should support: tcpip/ether failures, usb cable failures, sudden DispatcherDeviceFdDown(), sudden XLINK_RESET_REQ, etc. All of them should be handled cleanly and independently by the firmware itself. There is nothing for the depthai-core host code to do nicely (e.g. always close rpcClient's XLinkStream). Why? Because there are an unbounded number of disconnect scenarios and the OAK firmware needs to handle them by itself in isolation of any host.

diablodale · 2023-04-25T16:37:46Z

I've isolated an unexpected grouping of XLink failures with rapid restarts. When this PR's test "Multiple devices created and destroyed in parallel" is run with 2 OAK-D-Lite and 1 OAK-D attached, I get failures. Might have to run the test more than once, but it should eventually fail. On my laptop, it fails almost every run.

Fails in DeviceBase::startPipelineImpl when it makes rpc calls. Error is dai::XLinkWriteError and it occurs on pimpl->rpcClient->call("printAssets") or pimpl->rpcClient->call("startPipeline"). Those two call locations are additionally curious to me because:

when both of those calls occur, there have been previous rpcClient calls that didn't throw. That suggests to me that the underlying XLink dispatcher for this connection has transitioned from a good state to a bad state that can't write -or- there is something in common with the two calls and unique from the earlier calls.
Both calls have no rpc parameters. Both calls have no .as<> for a return value. That makes these two unique in the whole function as earlier calls do have params or return values. This might be a false lead, but so far...no other rpcs are failing.

The rpc to printAssets...can we remove that? Or make it debug only?

diablodale · 2023-04-26T12:37:31Z

Probably found bug

95% sure I found a somewhat relevant bug here

depthai-core/src/xlink/XLinkConnection.cpp

Line 342 in 125feb8

} while(!found && steady_clock::now() - t1 < BOOTUP_SEARCH);

!found should not be there. It errantly causes early exit of the while loop when anything in any state is found.

OAK devices are not closing cleanly

For the below test run...

Changed the if test in the above loop to only exit on X_LINK_BOOTLOADER
Changed BOOTUP_SEARCH to 10s.
added more logging throughout XLinkConnection::close()
set XLINK_LEVEL=fatal

I am assuming that X_LINK_BOOTLOADER is the "at rest" state of a USB device like OAK-D and OAK-D-Lite after it is plugged into a computer's USB port and letting the OAK device power on and load its internally-flashed bootloader. It appears in the OS as Movidius MyriadX and USB\VID_03E7&PID_2485&REV_0001

Found device count: 3
MXID: 1844301031490A1300
MXID: 18443010C17F0C1300
[14442C10C1A1C6D200] [3.1] [1.729] [system] [critical] Error executing RPC call: [json.exception.parse_error.110] parse error at byte 2: syntax error while parsing MessagePack value: expected end of input; last byte: 0x00
[2023-04-26 12:31:26.317] [warning] RPC mxid(14442C10C1A1C6D200) error: Couldn't read data from stream: '__rpc_main' (X_LINK_ERROR)
[2023-04-26 12:31:26.320] [warning] XLinkResetRemoteTimeout(1) mxid(14442C10C1A1C6D200) returned: X_LINK_ERROR
[2023-04-26 12:31:36.181] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(18443010C17F0C1300) failed to find device in 10s
[2023-04-26 12:31:36.208] [warning] Reboot after XLinkResetRemoteTimeout(0) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:31:36.356] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(14442C10C1A1C6D200) failed to find device in 10s

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
multiple_devices_test.exe is a Catch2 v3.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Multiple devices created and destroyed in parallel
-------------------------------------------------------------------------------
..\tests\src\multiple_devices_test.cpp(106)
...............................................................................

..\tests\src\multiple_devices_test.cpp(173): FAILED:
explicitly with message:
  Fail construct Device() with DeviceInfo(name=3.1, mxid=14442C10C1A1C6D200,
  X_LINK_UNBOOTED, X_LINK_USB_VSC, X_LINK_MYRIAD_X, X_LINK_SUCCESS) Device
  already closed or disconnected: io error

MXID: 18443010C17F0C1300
MXID: 1844301031490A1300
MXID: 14442C10C1A1C6D200
[2023-04-26 12:31:48.160] [warning] Reboot after XLinkResetRemoteTimeout(2) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:31:48.352] [warning] Reboot after XLinkResetRemoteTimeout(3) mxid(14442C10C1A1C6D200) failed to find device in 10s
[2023-04-26 12:31:48.434] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(18443010C17F0C1300) failed to find device in 10s
MXID: 1844301031490A1300
MXID: 18443010C17F0C1300
MXID: 14442C10C1A1C6D200
[2023-04-26 12:32:00.295] [warning] Reboot after XLinkResetRemoteTimeout(3) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:32:00.314] [warning] Reboot after XLinkResetRemoteTimeout(4) mxid(18443010C17F0C1300) failed to find device in 10s
[2023-04-26 12:32:00.788] [warning] Reboot after XLinkResetRemoteTimeout(4) mxid(14442C10C1A1C6D200) failed to find device in 10s
===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 11 | 10 passed | 1 failed

In that single run, I see many unexpected things

9 times the devices were rebooted yet couldn't be found in 10 seconds. Why are these devices taking so long? I also set it to 15s and I still get fails to find.
1 time the OAK hardware failed to execute an RPC call. I can't find any host code that has the text Error executing RPC call so I think the OAK devices received a message on the RPC named stream, tried to unpack it and it was invalid. This should not happen; it suggests to me a bug somewhere between nanorpc on the host and nanorpc on the device.

I have also run the test with XLINK_LEVEL=debug and see often libusb errors about permission problems. Which on my Windows setup is unexpected as I remember permissions being related to UDEV

...
D: [global] [         0] [ThreadN] getLibusbDeviceMxId:326      libusb_open: Access denied (insufficient permissions) 
D: [global] [         0] [ThreadN] getUSBDevices:171    getLibusbDeviceMxId returned: Access denied (insufficient permissions)

...

themarpe · 2023-05-02T18:12:14Z

Thanks a bunch for all the exploration @diablodale (sorry for the delay)

That's my reasoning of the code...which I definitely can be wrong sweat_smile. I can't align my reasoning with why you want the underlying XLinkStream of rpcClient to be closed...and why you have a belief it puts "the device in an odd state".

This seems to be what resolved the issue with the initial commit - though it might not be the actual culprit. More testing required - I'm not actually sure either

XLink dispatcher for this connection has transitioned from a good state to a bad state that can't write

Most likely the case - the RPCs themselfs likely aren't the culprit.

The rpc to printAssets...can we remove that? Or make it debug only?

Yes - we we can remove this outright.

95% sure I found a somewhat relevant bug here
!found should not be there. It errantly causes early exit of the while loop when anything in any state is found.

Good catch - Yes, that would in fact exit out too soon. Shall we add it here: https://github.com/luxonis/depthai-core/pull/809/files#diff-17c7352c03ae1d3bb07f3e7a3c1b6f98312f549a440e186f450e3a8f563e3121L338-L340 with else found = false?

I am assuming that X_LINK_BOOTLOADER is the "at rest" state of a USB device like OAK-D and OAK-D-Lite after it is plugged into a computer's USB port and letting the OAK device power on and load its internally-flashed bootloader. It appears in the OS as Movidius MyriadX and USB\VID_03E7&PID_2485&REV_0001

That would be X_LINK_UNBOOTED state. X_LINK_BOOTLOADER would be for cases where device is presented as Luxonis Bootloader (the former is ROM BL/1st stage, the latter is NOR flashed custom BL (2nd stage))

I have also run the test with XLINK_LEVEL=debug and see often libusb errors about permission problems. Which on my Windows setup is unexpected as I remember permissions being related to UDEV

I think this relates to device being in a state that comms aren't possible. Though not 100% sure what is the underlying cause for this error message from libusb on Windows

diablodale · 2023-05-09T18:03:42Z

Unfortunately, I don't think I can make progress on the connection issue(s) because I don't have visibiity into the OAK device and its firmware. There is not enough public knowledge on its state changes, internal logic, etc.

All my debugging to date indicates to me hardware problems in state transitions. Either libusb code failing to transition states timely, or the OAK device failing to transition states promptly. This is backed by my repeated test failure evidence and logs showing that even after 15 seconds OAK devices can not be seen by libusb-xlink code (meaning libusb-xlink, WIndows, or OAK have failed to work as expected). If it doesn't work on USB, it is probably even worse with PoE. I recomemnd focus first on USB and get it working. My expanded test case in this PR is an easy repro to expose the bug.
To isolate, I attempted to use XLink from depthai-core v2.15.5 (the last xlink with winusb). I failed because the XLink interface had large changes with winusb->libusb and broke both the xlink API and its behavior. Getting current depthai-core to work with the winusb-XLink was more than a half-day of changes (I stopped then).
Do you have any docs and flow charts that document the state changes for OAK devices? I can find none in code or in the online depthai docs. At a mimum it would have circles for each XLinkDeviceState_t and arrows showing all possible transitions and how they occur. For example, all circles except the first would have an arrow for kill -9 on app that goes back to the beginning. I suspect some of those arrows are not coded correctly in today's depthai+xlink+firmware. Such a state diagram would be used for all XLink coding and test case creation.

If you want, I can cleanup, or close an open a new PR, to include the one bug I found and all the cleanup work. It doesn't fix the connection problems -- instead forward movement to have cleaner build, enables more analyzers/warnings, etc.

diablodale · 2023-05-11T18:24:59Z

@themarpe I've cleaned up the commits in this PR. It has some fixes that are outside the main topic. If wanted, I can break them out into several separate PRs.

[EDIT] And the last commit will be dropped before merge. It is additional debugs and warns and a supporting API. (which itself could be included in production code and the APIs dropped if the Logger class can always show the mxid.

Meanwhile, below is my first attempt mermaid of the XLINK states and their base transitions. Are there any corrections that need to be made? In this 1st diagram, I haven't added reboots, fails, etc.

flowchart TD

DEVICE_TYPE{USB device?}
STANDALONE{Standalone<br/>flashed device?}

X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]

HOST_CREATE_DEVICE["host app: Device(pipeline)"]
HOST_BOOTLOADER["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]

STANDALONE_BOOTLOADER["device boots its bootloader from flash"]
STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]

DEVICE_TYPE -->|yes| X_LINK_UNBOOTED
X_LINK_UNBOOTED --> STANDALONE

STANDALONE -->|no| HOST_CREATE_DEVICE
STANDALONE -.->|yes| STANDALONE_BOOTLOADER

STANDALONE_BOOTLOADER -.-> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
STANDALONE_PIPELINE -.-> X_LINK_FLASH_BOOTED

HOST_CREATE_DEVICE --> HOST_BOOTLOADER
HOST_BOOTLOADER --> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -->HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE
HOST_RPC_PIPELINE --> X_LINK_BOOTED

    subgraph "host sdk code: Device(pipeline)"
        HOST_BOOTLOADER
        HOST_RPC_READ
        HOST_RPC_PIPELINE
    end

    subgraph "Automatic standalone flash behavior"
        STANDALONE_BOOTLOADER
        STANDALONE_READ
        STANDALONE_PIPELINE
    end

- related to luxonis#805 and luxonis#807

- fix luxonis#805

themarpe · 2023-05-15T17:30:40Z

@diablodale
The following diagram is how the boot process flows. The ROM BL decides (based on HW + NOR contents) on whether or not it boots BL (if its present, some devices don't have NOR, and can only ever boot directly from UNBOOTED->BOOTED).

Each of transitions is a USB down->up event as reenumeration is done.

One extra case is if you call DeviceBootloader, then it boots into X_LINK_BOOTLOADER state if it was UNBOOTED before. Otherwise it just connects to running Bootloader

(and neat thing, this mermaid flowchart:) )

flowchart TD

DEVICE_TYPE{USB device}

X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]

HOST_XLINK_BOOT["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]

HOST_BOOTLOADER_BOOT["DeviceBootloader::bootMemory(firmwareBin)"]

STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]


DEVICE_TYPE -->|no BL or BOOT btn is pressed| X_LINK_UNBOOTED
DEVICE_TYPE -->|has BL| X_LINK_BOOTLOADER

X_LINK_UNBOOTED --> HOST_XLINK_BOOT

X_LINK_BOOTLOADER -->|standalone flashed| X_LINK_FLASH_BOOTED
X_LINK_FLASH_BOOTED -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
X_LINK_BOOTLOADER -->|normal| HOST_BOOTLOADER_BOOT
HOST_BOOTLOADER_BOOT --> X_LINK_BOOTED

HOST_XLINK_BOOT --> X_LINK_BOOTED

X_LINK_BOOTED --> HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE

subgraph "host app code: Device(pipeline) [after boot part]"
    X_LINK_BOOTED
    HOST_RPC_READ
    HOST_RPC_PIPELINE
end

subgraph "Automatic standalone flash behavior"
    X_LINK_FLASH_BOOTED
    STANDALONE_READ
    STANDALONE_PIPELINE
end

themarpe · 2023-05-15T17:31:12Z

If you want, I can cleanup, or close an open a new PR, to include the one bug I found and all the cleanup work. It doesn't fix the connection problems -- instead forward movement to have cleaner build, enables more analyzers/warnings, etc.

You may include it here

themarpe

Changes LGTM at current state

src/xlink/XLinkConnection.cpp

diablodale · 2023-05-16T16:19:33Z

First CI issue is that only with c++17 is inline implicit with constexpr so those definitions I removed need to be put back. I'll fix that and push

[EDIT] I rebased the fix into the code cleanup commit and force pushed

[EDIT2] I was unable to find any flag for MSVC so it would fail to compile like GNU does. I tried both /permissive- and /Zc:externConstexpr. MSVC with VS2019 continued to compile the non-compilant c++14 code. 🤷‍♀️

- revert more closed locking behavior luxonis#520

diablodale · 2023-05-18T15:46:10Z

I don't see any "code" faults. Woohoo 😃

The CI thinks there are failures because:

the extra debug has the word "warning" in it which is then detected by the ctest regex and fails the test
the multiple_devices_test case 2 REQUIRES 2+ hardware devices to be attached. And it fails otherwise. The only machine that ran in CI was a windows machine which appears to have only one device attached.

I need to know what are next steps / answers to questions.

What do you want to do about only 1 OAK on the test machine? Should that case be marked/filtered using flags? Or what do you want?
Is your team good with this PR's state so that I can remove the extra debug and push?

themarpe · 2023-05-26T17:18:23Z

What do you want to do about only 1 OAK on the test machine? Should that case be marked/filtered using flags? Or what do you want?

IMO best if we tag it (like PoE/USB) and skip that tag by default for now (as we don't have HIL In this repo that handles this scenario ATM - we'll add it afterwards)

Is your team good with this PR's state so that I can remove the extra debug and push?

LGMT, thanks!

- multiple devices test case is tagged so that it will not run with the default test list using the [.] syntax

diablodale · 2023-06-07T16:20:16Z

Force-pushed. I removed the last commit that had the extra debug in it -- replaced it with a small commit that clarified some comments and debug log messages.

Major concern

The majority 50+% of OAK usb devices do not reboot fast enough with the current depthai-core codebase on Windows.
PoE is unknown. Other OS are unknown.

depthai-core/src/xlink/XLinkConnection.cpp

Line 316 in 5f75ae7

constexpr auto BOOTUP_SEARCH = seconds(5);

This PR's new test case with multiple devices --- more than half the USB device reboots do not complete within the BOOTUP_SEARCH 5 seconds. This easily cascades to failed connections when depthai-core attempts to boot and upload firmware to the device. When depthai-core attempts this after-reboot activity, some part of the USB device accepts a connection (myraid firmware?) but that connection seems to be brain-dead or not functioning correctly. This series of states causes various levels of depthai-core errors...some may cause exceptions to be thrown. The app will crash if these exceptions are not caught with a try/catch around the Device() constructor.

I extended BOOTUP_SEARCH to 10 seconds. Some reboots still did not complete within 10 seconds. I estimate 25-33% of them failed. Eventually, all 3 of my USB devices reboot, yet the time needed to reboot can be many seconds beyond 10.

I recommend BOOTUP_SEARCH be extended or new reboot logic for depthai-core to accommodate the time needed to reboot. While new firmware/bootloader may help, it doesn't help all those devices which are not burned with the new firmware.

It is not useful for depthai-core to start uploading firmware to a device when it hasn't completely rebooted. The existing codebase does this. The code might have been written to accommodate the scenario of a device being unplugged after/during a reboot....and not wanting that user to wait more than 5 seconds. However, this choice leads to the majority of USB reboots failing. That doesn't seem a wise choice given the failure scenarios I describe above.

Separately, is the concern of a 10+ second reboot for a usb device. This is a topic to be explored by the Luxonis team who has access to the firmware and multiple host OS+hardware.

Updated depth colorization logic

diablodale force-pushed the device-tests-close-threads branch 2 times, most recently from 5493e90 to 319a647 Compare April 15, 2023 02:32

diablodale mentioned this pull request Apr 15, 2023

RPC serialization and its mutex locks and/or shared_ptr use are unclear and likely crash #807

Open

themarpe reviewed Apr 23, 2023

View reviewed changes

diablodale force-pushed the device-tests-close-threads branch from 2a3bc20 to 5668249 Compare April 24, 2023 15:53

diablodale force-pushed the device-tests-close-threads branch 2 times, most recently from de21468 to 6ea13e0 Compare May 11, 2023 18:17

diablodale marked this pull request as ready for review May 11, 2023 18:25

diablodale added 2 commits May 11, 2023 22:29

add Device tests for close and threading crashes

2db09d4

- related to luxonis#805 and luxonis#807

persist rpcClient and resources to fix crash bug

06ecf45

- fix luxonis#805

diablodale force-pushed the device-tests-close-threads branch from 3a404ea to 5b8a566 Compare May 11, 2023 20:30

themarpe reviewed May 15, 2023

View reviewed changes

src/xlink/XLinkConnection.cpp Show resolved Hide resolved

diablodale added 6 commits May 16, 2023 18:39

code cleanup; reduce warnings

d2a2884

fix race cond, redundant type casts in profling

dfa8655

change closed back to atomic<bool> to avoid race, deadlocks

2463ee6

- revert more closed locking behavior luxonis#520

fix logger::get_level() logic

5bfd997

fix XLinkConnection::close() reboot logic

7c63c4c

remove unneeded rpc(printAssets)

d2ddf25

diablodale force-pushed the device-tests-close-threads branch from 5b8a566 to 80d3593 Compare May 16, 2023 16:41

add comments; clarify debug log; tag test case

7b17b29

- multiple devices test case is tagged so that it will not run with the default test list using the [.] syntax

diablodale force-pushed the device-tests-close-threads branch from 80d3593 to 7b17b29 Compare June 7, 2023 16:42

diablodale mentioned this pull request Jul 3, 2023

[BUG] Occasional X_LINK_ERROR on POE devices #845

Open

jakgra pushed a commit that referenced this pull request Mar 20, 2024

Merge pull request #809 from luxonis/update_depth_visualziation

e6ac495

Updated depth colorization logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Device tests for close and threading crashes #809

add Device tests for close and threading crashes #809

diablodale commented Apr 13, 2023 •

edited

Loading

themarpe left a comment

diablodale commented Apr 24, 2023 •

edited

Loading

themarpe commented Apr 24, 2023

diablodale commented Apr 25, 2023

diablodale commented Apr 25, 2023 •

edited

Loading

diablodale commented Apr 26, 2023 •

edited

Loading

themarpe commented May 2, 2023 •

edited

Loading

diablodale commented May 9, 2023

diablodale commented May 11, 2023 •

edited

Loading

themarpe commented May 15, 2023

themarpe commented May 15, 2023

themarpe left a comment

diablodale commented May 16, 2023 •

edited

Loading

diablodale commented May 18, 2023

themarpe commented May 26, 2023 •

edited

Loading

diablodale commented Jun 7, 2023

add Device tests for close and threading crashes #809

Are you sure you want to change the base?

add Device tests for close and threading crashes #809

Conversation

diablodale commented Apr 13, 2023 • edited Loading

themarpe left a comment

Choose a reason for hiding this comment

diablodale commented Apr 24, 2023 • edited Loading

themarpe commented Apr 24, 2023

diablodale commented Apr 25, 2023

diablodale commented Apr 25, 2023 • edited Loading

diablodale commented Apr 26, 2023 • edited Loading

Probably found bug

OAK devices are not closing cleanly

themarpe commented May 2, 2023 • edited Loading

diablodale commented May 9, 2023

diablodale commented May 11, 2023 • edited Loading

themarpe commented May 15, 2023

themarpe commented May 15, 2023

themarpe left a comment

Choose a reason for hiding this comment

diablodale commented May 16, 2023 • edited Loading

diablodale commented May 18, 2023

themarpe commented May 26, 2023 • edited Loading

diablodale commented Jun 7, 2023

Major concern

diablodale commented Apr 13, 2023 •

edited

Loading

diablodale commented Apr 24, 2023 •

edited

Loading

diablodale commented Apr 25, 2023 •

edited

Loading

diablodale commented Apr 26, 2023 •

edited

Loading

themarpe commented May 2, 2023 •

edited

Loading

diablodale commented May 11, 2023 •

edited

Loading

diablodale commented May 16, 2023 •

edited

Loading

themarpe commented May 26, 2023 •

edited

Loading