Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ames/harvesters/caltechdata.py fails at line 35 #14

Open
rsdoiel opened this issue Oct 27, 2022 · 7 comments
Open

ames/harvesters/caltechdata.py fails at line 35 #14

rsdoiel opened this issue Oct 27, 2022 · 7 comments
Assignees
Labels

Comments

@rsdoiel
Copy link
Member

rsdoiel commented Oct 27, 2022

This came up after updating Feeds v1 to use v0.9.0 release of ames. Harvesting from CaltechDATA is aborting. The full error message is

raceback (most recent call last):
  File "/Sites/feeds.library.caltech.edu/harvest.py", line 265, in <module>
    harvest_caltechdata('etc/CaltechDATA.json')
  File "/Sites/feeds.library.caltech.edu/feeds/caltechdata.py", line 33, in harvest_caltechdata
    get_caltechdata(collection_name, production=True)
  File "/Users/datawork/Library/Python/3.10/lib/python/site-packages/ames/harvesters/caltechdata.py", line 35, in get_caltechdata
    hits += response["hits"]["hits"]
KeyError: 'hits'

I believe the issue is that line 35 uses response["hits"]["hits"] before one or other levels of the array has been initialized. If this is correct then the hits += should fall back to zero is my guess. I will test this and then do a pull request when I have a fix ready.

@rsdoiel rsdoiel self-assigned this Oct 27, 2022
@rsdoiel rsdoiel added the bug label Oct 27, 2022
@tmorrell
Copy link
Member

tmorrell commented Oct 27, 2022 via email

@rsdoiel
Copy link
Member Author

rsdoiel commented Oct 27, 2022

I've deployed the branch issue-14 in AMES and that seems to work, still tracking down other bugs. YMMV so review my changes before accepting the pull request.

@rsdoiel
Copy link
Member Author

rsdoiel commented Oct 27, 2022

Does hits come from https://data.caltech.edu/api/records? If so I think the code is ask for https://data.caltech.edu/api/records/ (note the trailing slash) and is because Invenio-RDM is expecting a id to follow the slash so throughs an error. If so the changes I suggested previously are masking the problem not solving it.

@rsdoiel
Copy link
Member Author

rsdoiel commented Oct 27, 2022

After looking closer at how things work I think the problem may be in caltechdata_api. In terms of processing you should be able to retrieve an setup of ids. I think this is done with get_metadata() but if no ids are specific request the URL that gets used ends in the '/' as mentioned previously. This is an erorr for our Invenio-RDM deployment. What needs to be requested is probably the clean URL without trailing '/' but with the paging parameters to get the chunks.

This could be done by tweaking get_metadata() in the caltechdata_api to handle that case or it could be done as a specific function call. Not sure correct approach.

@tmorrell
Copy link
Member

I've taken a look at the code and it all should work. get_metadata just returns metadata from one record at a time, while get_caltechdata collects all the record ids and them loops over them. We could make this much more efficient by using an updated data parameter, but this should work for the moment. Both caltechdata_api and ames will have to be installed from the latest main branch to work. I didn't want to mess with datawork since it seems feeds at least partially ran this morning.

@rsdoiel
Copy link
Member Author

rsdoiel commented Oct 28, 2022 via email

@rsdoiel
Copy link
Member Author

rsdoiel commented Nov 10, 2022

What dates value should I used as "updated"? The data I get via AME's doesn't seem to include one when I use get_caltechdata to populate CaltechDATA.ds in feeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants