Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-archive.err[0]: job_info_lookup_continuation: parse job #4403

Closed
garlick opened this issue Jul 12, 2022 · 2 comments · Fixed by #4422
Closed

job-archive.err[0]: job_info_lookup_continuation: parse job #4403

garlick opened this issue Jul 12, 2022 · 2 comments · Fixed by #4422
Assignees

Comments

@garlick
Copy link
Member

garlick commented Jul 12, 2022

25 of these errors appeared in my test system instance logs last friday, when it was rebooted after I updated flux to include an early version of #4398. Prior to that the system was running 0.39.0. job-manager.inactive-age-limit was set to "4d".

At minimum the error handling could be improved (what was the parse error? what job ID)?

It looks like this is an error parsing a job entry returned from flux_job_list_inactive() RPC?

Here are the errors in context

Jul 08 12:18:54 picl0 flux[599369]: job-manager.info[0]: restart: 3 jobs
Jul 08 12:18:54 picl0 flux[599369]: job-manager.info[0]: restart: 0 running jobs
Jul 08 12:18:54 picl0 flux[599369]: broker.info[0]: rc1.0: running /usr/local/etc/flux/rc1.d/02-cron
Jul 08 12:18:54 picl0 flux[599369]: broker.info[0]: rc1.0: /usr/local/etc/flux/rc1 Exited (rc=0) 1.0s
Jul 08 12:18:54 picl0 flux[599369]: broker.info[0]: rc1-success: init->quorum 0.970209s
Jul 08 12:18:54 picl0 flux[599369]: broker.info[0]: online: picl0 (ranks 0)
Jul 08 12:18:54 picl0 flux[599369]: broker.info[0]: quorum-full: quorum->run 0.101253s
Jul 08 12:19:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:20:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:21:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:22:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:23:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:24:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:25:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:26:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:27:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:28:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:29:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:30:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:31:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:32:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:33:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:34:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:35:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:36:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:37:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:38:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:39:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:40:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:41:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:42:27 picl0 flux[599369]: broker.info[0]: accepting connection from picl6 (rank 6) uuid 534cb01b-11a6-4f5a-a836-7045a9ea6c53 status full
Jul 08 12:42:27 picl0 flux[599369]: broker.info[0]: accepting connection from picl4 (rank 4) uuid 8b97ca3d-17af-45ef-9fb2-bdb24c8960fa status full
Jul 08 12:42:28 picl0 flux[599369]: broker.info[0]: online: picl[0,4,6] (ranks 0,4,6)
Jul 08 12:42:28 picl0 flux[599369]: broker.info[0]: accepting connection from picl3 (rank 3) uuid e3e7b553-10b8-4233-a594-70001b5235d0 status full
Jul 08 12:42:29 picl0 flux[599369]: broker.info[0]: accepting connection from picl2 (rank 2) uuid a72fb232-3b33-4a62-a02c-c807eb398966 status full
Jul 08 12:42:29 picl0 flux[599369]: broker.info[0]: online: picl[0,3-4,6] (ranks 0,3-4,6)
Jul 08 12:42:30 picl0 flux[599369]: broker.info[0]: online: picl[0,2-4,6] (ranks 0,2-4,6)
Jul 08 12:42:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
Jul 08 12:43:54 picl0 flux[599369]: job-archive.err[0]: job_info_lookup_continuation: parse job
@chu11
Copy link
Member

chu11 commented Jul 18, 2022

ahhh, i suspect this is a fallout from #4327, ranks is something that is optionally returned now from job-list, it is no longer set to 0 by default.

Edit: oh wait, I am parsing it correctly, ensuring ranks is optionally returned .. hmmmm

@chu11
Copy link
Member

chu11 commented Jul 18, 2022

So this error message is actually a tad bothersome, it comes from what should be a "impossible" error.

    if (json_unpack (job, "{s:I s:i s?:s s:f s?:f s?:f s:f}",
                     "id", &id,
                     "userid", &userid,
                     "ranks", &ranks,
                     "t_submit", &t_submit,
                     "t_run", &t_run,
                     "t_cleanup", &t_cleanup,
                     "t_inactive", &t_inactive) < 0) {
        flux_log (ctx->h, LOG_ERR, "%s: parse job", __FUNCTION__);
        goto out;
    }

The suggestion is a job returned from list-inactive does not have atleast one of id, userid, t_submit or t_inactive set (note the above ranks, t_run, and t_cleanup are optional).

So either the list-inactive code is doing something really wrong, or maybe the KVS was corrupted? or something else? hard to know.

For now, lets stick in a little extra debugging like the job id and use json_unpack_ex() to get some extra info.

note, also possible this becomes moot if we begin to use #4336

@chu11 chu11 self-assigned this Jul 18, 2022
chu11 added a commit to chu11/flux-core that referenced this issue Jul 23, 2022
Problem: The error

job-archive.err[0]: job_info_lookup_continuation: parse job

was recently observed, which did not provide insight into the deeper
problem.

Solution: Improve the error message by using json_unpack_ex() to
get parsing error information.  Also add the job ID to the error
output.

Fixes flux-framework#4403
@mergify mergify bot closed this as completed in #4422 Jul 25, 2022
chu11 added a commit to chu11/flux-core that referenced this issue Aug 8, 2022
Problem: The error

job-archive.err[0]: job_info_lookup_continuation: parse job

was recently observed, which did not provide insight into the deeper
problem.

Solution: Improve the error message by using json_unpack_ex() to
get parsing error information.  Also add the job ID to the error
output.

Fixes flux-framework#4403
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants