Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic MIN_SLEEP_DURATION causes a problem #513

Closed
byeonggiljun opened this issue Jan 28, 2025 · 5 comments · Fixed by #514 · May be fixed by #403
Closed

Optimistic MIN_SLEEP_DURATION causes a problem #513

byeonggiljun opened this issue Jan 28, 2025 · 5 comments · Fixed by #514 · May be fixed by #403

Comments

@byeonggiljun
Copy link
Collaborator

MIN_SLEEP_DURATION causes a flaky error in DecentralizedP2PUnbalancedTimeoutPhysical.lf (see this CI test result).

In the below logs, you can see that the federate doesn't wait until the physical time of 1 msec because the waiting time is less than MIN_SLEEP_DURATION (10 usec). However, this leads to advancing the current tag to 1 msec before the physical time exceeds 1 msec (994.0 usec). However, a message came through the physical connection, was scheduled at 996.6 usec, and invoked a fatal error.

DEBUG: -------- Waiting until physical time 1000000.
DEBUG: Wait time 6608 is less than MIN_SLEEP_DURATION 10000. Skipping wait.
DEBUG: Waiting is not interrupted. Current physical time is 993743.
DEBUG: Advanced (elapsed) tag to (1000000, 0) at physical time 993983.
Fed 1 (d): FATAL ERROR: DEBUG: get_next_event_tag(): Earliest event on the event queue ((996593, 0)) is earlier than the current tag ((1000000, 0)).

@edwardalee Do you think we can fix this by simply decreasing the amount of MIN_SLEEP_DURATION? Then, what should be the new value?

@erlingrj
Copy link
Collaborator

erlingrj commented Jan 28, 2025

I think we should remove MIN_SLEEP_DURATION, it was a misguided optimization.

@erlingrj
Copy link
Collaborator

Actually, it is also used in update_ports_from_staa_offsets in federate.c, I am not sure about the exact function it serves there. But the usage in wait_until is not needed

@byeonggiljun
Copy link
Collaborator Author

I agree. But I found the PR #403 related to MIN_SLEEP_DURATION. This PR will fix the issue because it adds a busy wait when the wait duration is less than MIN_SLEEP_DURATION. So, why don't we merge that PR?

@edwardalee
Copy link
Contributor

Wow, great detective work! The goal of MIN_SLEEP_DURATION and #403 was to reduce lag. I'm wondering if this bug could also be the cause for the watchdog failures.

@erlingrj
Copy link
Collaborator

I'm wondering if this bug could also be the cause for the watchdog failures.

Perhaps. We could try enabling the CI for watchdogs after merging these fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants