Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix initialization of DataStorm samples after session recovery #3294

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pepone
Copy link
Member

@pepone pepone commented Dec 19, 2024

Fix #3056

Copy link
Member

@bernardnormier bernardnormier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me where the fix is.

sample.value,
sample.timestamp));
assert(samplesI.back()->key);
return {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code, when samples.empty(), we set elementSubscriber->lastId to 0.

It's not immediately clear why we don't need that. Is this lastId already 0 for some other reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to explain the logic of lastId.

lastId is default initialized to 0 in Session.h

// The ID of the last processed sample.
std::int64_t lastId{0};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and the fix in this PR is exactly that: to not set lastId to 0 when samples is empty?

Copy link
Member Author

@pepone pepone Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is to not reset to 0 when the subscriber is initialized after recovery.

The subscriber received some samples and lastId is updated accordingly.

Then the Session is lost, when it reconnects subscriberInitialized is called again.

If the new call sent no samples, because there were no new samples since the recovery, the previous code was reseting lastId to 0. (that is the bug).

Now if session is lost again, the next recovery would tell the peer that the lastId it saw is 0, and the peer would send all queues elements. That is what was happening with the test failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an additional test that allows reproducing the initial issue.

cpp/test/DataStorm/reliability/Reader.cpp Show resolved Hide resolved
while (!connection)
{
this_thread::sleep_for(chrono::milliseconds(10));
connection = node.getSessionConnection(session);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions sometimes returns nullptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it returns nullptr when session is disconnected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the while loop required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to remove it. The idea was that the session might be recovering from a previous close connection. But here seems there is always a connection.

cpp/test/DataStorm/reliability/Reader.cpp Outdated Show resolved Hide resolved

// Session was reestablish close again
connection = node.getSessionConnection(session);
while (!connection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Shouldn't the connection exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes fixed

cpp/test/DataStorm/reliability/Reader.cpp Outdated Show resolved Hide resolved
cpp/test/DataStorm/reliability/Writer.cpp Outdated Show resolved Hide resolved
@bernardnormier bernardnormier self-requested a review December 20, 2024 15:45
@pepone pepone requested a review from externl December 20, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataStorm/reliability hang (macos, debug)
3 participants