Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog error when search-index rebuild a harvest source #2866

Closed
FuhuXia opened this issue Feb 22, 2021 · 1 comment
Closed

Catalog error when search-index rebuild a harvest source #2866

FuhuXia opened this issue Feb 22, 2021 · 1 comment
Assignees
Labels
component/catalog Related to catalog component playbooks/roles

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Feb 22, 2021

We can run following command to reindex a package
ckan search-index rebuild [dataset_name]
When running the command on a Catalog harvest source, it generates an SearchIndexError, then the harvest source disappears from the UI.

How to reproduce

  1. Create a data.json harvest source named, e.g. test-json-source, in a test organization on Catalog ( replicated on sandbox and staging)
  2. Run a harvest job
  3. SSH into harvester instance, run the command ckan search-index rebuild test-json-source

Expected behavior

Command runs successfully.

Actual behavior

It generates a SearchIndexError.
Harvest source test-json-source is removed from its organization.

Solr responded with an error (HTTP 400): [Reason: Error parsing JSON field value.
Unexpected OBJECT_START at [3098], field=status]

Screen Shot 2021-02-22 at 3 05 09 PM

@FuhuXia FuhuXia added the component/catalog Related to catalog component playbooks/roles label Feb 22, 2021
@FuhuXia
Copy link
Member Author

FuhuXia commented Feb 25, 2021

The cause of this issue is a bug found in pysolr. It only happens to harvest source that contains at least one harvest job run, making the pkg_dict sent to solr a multi-level nested JSON object.

{
	'owner_org': u'068066e8-4d6a-427d-b8b7-d4c63bdc733f',
	'maintainer': None, 
	...
	'status': {
		'job_count': 1,
		'total_datasets': 3L,
		'last_job': {
			'status': u'Finished', 
			'finished': '2021-02-23 19:52:38.606360',
			...
			'stats': {
				'deleted': 0,
				'updated': 0,
				...
			}
		}
	}
}

For the fresh harvest source with 0 harvest job, the last_job is None, we don't see this issue. It only happens to catalog.data.gov fcs branch where we upgraded pysolr version from CKAN default 3.6.0 to 3.9.0 for performance reason.

Since rebuild command is actually delete and add again, the record is deleted from solr index and not added back, the harvest source disappears from the UI.

Tested on 3.6.0, 3.7.0, 3.8.0, no such issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/catalog Related to catalog component playbooks/roles
Projects
None yet
Development

No branches or pull requests

2 participants