Skip to content

HIVE-29012: NPE in ExplainTask when protologging posthook is enabled #5872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Aggarwal-Raghav
Copy link
Contributor

What changes were proposed in this pull request?

Check HIVE-29012 for the stacktrace.

Why are the changes needed?

When protologging hook is enabled, it stores the query plan for a query as well. The conf object in ExplainTask is null for the "Hive Hook Proto Log Writer" thread.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Will see CI output + local single node setup/cluster testing

@Aggarwal-Raghav
Copy link
Contributor Author

Based on the debugging, the Task#initialize() method creates conf object in HiveServer2-Handler-Pool thread and when the post hook for protologging is triggered, it created a new thread (Hive Hook Proto Log Writer) which get the query plan and at that time, ExplainTask#outputPlan() has conf as null which causes NPE.

final int limit = conf.getIntVar(ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT);

@Aggarwal-Raghav
Copy link
Contributor Author

@zabetak, can you please help with the review as the changes are around HIVE-22173's code.

Copy link

@Aggarwal-Raghav Aggarwal-Raghav changed the title [WIP] HIVE-29012: NPE in ExplainTash when protologging posthook is enabled HIVE-29012: NPE in ExplainTash when protologging posthook is enabled Jun 16, 2025
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Aggarwal-Raghav ! I left some questions/suggestions for the proposed fix.

For completeness, it would be nice to also add a unit test possibly somewhere in TestHiveProtoLoggingHook or TestHiveHookEventProtoPartialBuilder. Some fixes are trivial but writting also tests about them increases test coverage and guards against future regressions.

Comment on lines +1006 to +1008
if (conf == null && this.work != null && this.work.getFetchTask() != null) {
conf = this.work.getFetchTask().getConf();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial NPE may be avoided but it is not guaranteed that this.work.getFetchTask().getConf() will be not null so technically the code could still hit NPE. Furthermore, I see at least 9 references to conf inside this class that do not account for the nullability of the field so fixing one of those seems somewhat incomplete.

Moreover, this kind of fallback strategy to obtain a conf object from the internal fetch task else does not seem to be a widespread pattern so not sure what's the reasoning behind it.

Since the problem seems to be triggered mainly by HiveProtoLoggingHook I was wondering if we should rather ensure that the respective class passes a non-null configuration object even if that is the default one (new HiveConf()).

Another alternative would be to do something like the following:

final int limit = conf == null ? ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT.defaultIntVal
          : conf.getIntVar(ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT);

that definitely prevents the NPE no matter what, but it is still a localized fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and suggestions @zabetak and will try to come up with a UT as well.

My understanding:

  1. I also though of passing the non-null config from HiveProtoLoggingHook class but it would require some method parameter/argument changes across classes and might need additional check if prologging is disabled. So didn't gave a try 🙁.
  2. The reason behind using this.work.getFetchTask().getConf() is that I wanted to use the conf which was present in ExplainTask without propogating it from other classes.
  3. As you said that there are 9 references of conf but while debugging only ExplainTask#outputPlan() having conf.getIntVar(ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT) is getting triggered. So I think handling, the localized place should be enough.
  4. Using ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT.defaultIntVal, as you suggested, can work as using the default value shouldn't have any impact on OtherInfoType.CONF. Will test and confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1., I was thinking that changing just a bit HiveHookEventProtoPartialBuilder should work to avoid this NPE:

private JSONObject getExplainJSON(ExplainWork explainWork) throws Exception {
    ExplainTask explain = (ExplainTask) TaskFactory.get(explainWork, new HiveConf());
    return explain.getJSONPlan(null, explainWork, stageIdRearrange);
}

This approach seems better than 4, although it doesn't eliminate the potential NPE from other code paths.

The limit could also become part of the ExplainConfiguration but I guess this change would have bigger impact.

Anyways, there are many more options, so I will leave the final decision up to you.

@Aggarwal-Raghav Aggarwal-Raghav changed the title HIVE-29012: NPE in ExplainTash when protologging posthook is enabled HIVE-29012: NPE in ExplainTask when protologging posthook is enabled Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants