Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30993 Fix file access for index read activity #18144

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

shamser
Copy link
Contributor

@shamser shamser commented Dec 14, 2023

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@@ -98,7 +98,7 @@ class CIndexReadSlaveBase : public CSlaveActivity
if (!keyManager)
throw MakeActivityException(&activity, 0, "Callback attempting to read blob with no key manager - index being read remotely?");
needsBlobCleaning = true;
return (byte *) keyManager->loadBlob(id, dummy, &activity.contextLogger);
return (byte *) keyManager->loadBlob(id, dummy, nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blob stats will be address in this ticket: https://track.hpccsystems.com/browse/HPCC-31002

{
IPartDescriptor &part = partDescs.item(p++);
IPartDescriptor &part = partDescs.item(p);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in lines 169-171 is so that the correct contextLogger can be referenced with [p]. If p++ remains here then contextLoggers[p] cannot be used in line 306.

@@ -273,7 +272,7 @@ class CIndexReadSlaveBase : public CSlaveActivity
continue; // try next copy and ultimately failover to local when no more copies
}
ActPrintLog("[part=%d]: reading remote dafilesrv index '%s' (logical file = %s)", partNum, path.str(), logicalFilename.get());
partNum = p;
partNum = (p+1);
Copy link
Contributor Author

@shamser shamser Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change relates to: 55ef2b9#r1426670630

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sensible change. Not new, but might be good to add a comment to clarify this is returning next part #, e.g.:

partNum = p+1; // NB: returning next part number to process.

keyManager = klManager;
partNum = p;
partNum = (p+1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change relates to: 55ef2b9#r1426670630

return createIndexLookup(keyManager);
}
}
keyMergerManager.setown(createKeyMerger(helper->queryDiskRecordSize()->queryRecordAccessor(true), keyIndexSet, seekGEOffset, &contextLogger, helper->hasNewSegmentMonitors(), false));
//Not tracking jhtree cache stats in KeyMerger at the moment. Future: something to consider
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jira has been created to address KeyMerger stats: https://track.hpccsystems.com/browse/HPCC-30999

updateStats();
lazyIFileIO.clear();
}
lazyIFileIO.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats gathering has been moved to gatherActiveStats so that the stats are updated periodically whilst the activity is running.

@shamser shamser requested a review from jakesmith December 14, 2023 12:47
@shamser shamser marked this pull request as ready for review December 14, 2023 12:47
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - there are some merge clashes, looks like it needs rebasing.

Could you also update commit and PR message and JIRA description with some comments describing what the issue this is fixing and how. Thanks.

@shamser shamser force-pushed the issue30993 branch 4 times, most recently from 961044a to 650c837 Compare December 21, 2023 14:35
@shamser shamser requested a review from jakesmith December 21, 2023 14:36
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - there are some merge clashes, looks like it needs rebasing.

Could you also update commit and PR message and JIRA description with some comments describing what the issue this is fixing and how. Thanks.

@shamser - also, the PR is failing (crashing by the looks of it) some regression tests.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update commit and PR message and JIRA description with some comments describing what the issue this is fixing and how. Thanks.

am not seeing an updated commit + PR message, and JIRA description ?

@shamser
Copy link
Contributor Author

shamser commented Dec 22, 2023

Serialize jhtree and file io stats to thor manager.
Record jhtree and file io stats for Index Read Activity (to activity scope)

(Correct cost calculation for index reads require jhtree stats, so with
this change the file access cost for index read should be correct.)

Note: this change is dependent on HPCC-31035 Move ctx stats merging to caller from CKeyLevelManager mergeStats

@shamser shamser requested a review from jakesmith December 22, 2023 15:55
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good, some minor comment/questions.

@@ -719,10 +710,66 @@ class CIndexReadSlaveBase : public CSlaveActivity
virtual void gatherActiveStats(CRuntimeStatisticCollection &activeStats) const
{
PARENT::gatherActiveStats(activeStats);
// *) jhtree stats will have been tracked as follows
// - For superfiles, there will be 1 IContextLogger in contextLoggers[] for each subfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 671-673 is creating 1 IContextLogger per physical part not per subfile,
which gatherActiveStats is depending on as it iterates through parts and does:

                        fileStatItem->merge(contextLoggers[partNum]->queryStats());

You could create only 1 IContextLogger per subfile, and then use mapSubPart to map the part# -> subfile# to reference contextLoggers[subfile]

Not sure worth it, but comment should be changed if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a very large number of parts? If that is the case, do you think the performance would be impacted enough to make the change worthwhile?

I'd like leave as it is unless it is really worthwhile. Although change looks trivial, it will require significant time consuming re-testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be quite large, e.g. a superfile of 50+ files of 400x files, but each workers only gets a slice of those, and these CStatsContextLogger objects aren't that big.
I would merge as is, and perhaps open a low-prio. JIRA to revisit it on a rainy day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a jira to revisit this: https://track.hpccsystems.com/browse/HPCC-31103

@@ -273,7 +272,7 @@ class CIndexReadSlaveBase : public CSlaveActivity
continue; // try next copy and ultimately failover to local when no more copies
}
ActPrintLog("[part=%d]: reading remote dafilesrv index '%s' (logical file = %s)", partNum, path.str(), logicalFilename.get());
partNum = p;
partNum = (p+1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sensible change. Not new, but might be good to add a comment to clarify this is returning next part #, e.g.:

partNum = p+1; // NB: returning next part number to process.

IKeyManager * keyManager;
{
CriticalBlock b(keyManagersCS);
if (!keyManagers.isItem(partNum))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a test (earlier) to exclude (and later handle differently) the merger case?
The test here may mean it passed for part 1, and bails out for part 2+ in the merge case, which will only have 1 IKeyManager in keyManagers.

Copy link
Contributor Author

@shamser shamser Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would merging stats even when merger is used be a problem? The merger was going to handled separately with https://track.hpccsystems.com/browse/HPCC-30999

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will cause exceptions, and it looks like stats will be correct in the non-super case.
But not clear to the reader what will happen in localMerge=true case at the moment.
I think worth a comment block that explains current caveat and that it needs refactoring, in if (localMerge) case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added line 728-730.

@shamser shamser requested a review from jakesmith January 10, 2024 13:32
@shamser
Copy link
Contributor Author

shamser commented Jan 10, 2024

@jakesmith I've push changes with comment changes. I may need to make further changes in response to my clarification request on your comments.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good, if you could add a comment in gatherActiveStats re. localMerge caveat, and then go ahead and squash.

jakesmith
jakesmith previously approved these changes Jan 12, 2024
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good, please squash

@jakesmith jakesmith self-requested a review January 12, 2024 11:12
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - re-tagged myself for review (although approved), so that #18162 can be merged 1st, then this rebased.
Please tag me when that is merged and this PR is rebased.

@jakesmith jakesmith dismissed their stale review January 12, 2024 11:14

@shamser - tagging myself for review (although approved), so that #18162 can be merged 1st, then this rebased. Please tag me when that is merged and this PR is rebased.

Serialize jhtree and file io stats to thor manager.
Record jhtree and file io stats for Index Read Activity (to activity scope)

(Correct cost calculation for index reads require jhtree stats, so with
this change the file access cost for index read should be correct.)

Signed-off-by: Shamser Ahmed <[email protected]>
@shamser
Copy link
Contributor Author

shamser commented Jan 16, 2024

@jakesmith

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser the changes look good - I didn't follow all the details.
A couple of questions which would help clarify the code, but happy to merge as-is.

{
CriticalBlock b(keyManagersCS);
if (!keyManagers.isItem(partNum))
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser could this break rather than continuing? partNum will only get larger, and the test above will always fail. (If something else is adding them at the same time I guess it could succeed, but will be blank.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a critical section to protect against changes to KeyManagers.

// - the io stats will be tracked in keyManagers[]
// - (There will be exactly 1 keyManager per part)
// Note: the KeyManagers[], contextLoggers[] and fileStats[] are not necessarilly the same size
// - KeyManagers[] will always be the same size as partDescs[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to be quite true, since line 742 is testing it is true. Worth adding a comment where when it might not be true.

@ghalliday ghalliday merged commit 9391425 into hpcc-systems:candidate-9.4.x Feb 2, 2024
36 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants