Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVCS-353] Look for Dataverse renamed files on upload #300

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

AddisonSchiller
Copy link
Contributor

Dataverse 'ingests' certain file types. These file types get renamed.
In upload when Waterbutler tries to find the correct metadata to return,
it will 500 since it was not looking for the renamed file.

Ticket

https://openscience.atlassian.net/browse/SVCS-353

Purpose

Uploading some file types to dataverse would cause a 500 error to be thrown in waterbutler. The file would still upload, but would have a new extension.

Changes

Allow waterbutler to return the correct metadata based on the current file name, as well as a new property called file.original_name. This new property will parse DV raw metadata to figure out what the original file type was. This way WB can return the correct metadata and won't throw a 500.
added tests for new functions.

Side effects

Currently DV just deletes old files if you try to upload a conflicting file name. (unsure if this is correct functionality or not)
Because DV changes the file names when uploading certain file types (csv, xlsx, and others), it wont be able to delete the original because it will look like it does not exist. Instead DV uses their own renaming scheme , <conflicting-file-2< ...
Because of this, when a file that DV renames gets uploaded, if it got renamed with a -1, or a -2 etc, WB won't look for that. instead it will return the metadata of the non-renamed version.
This shouldn't be much of a problem, as it is really only used for updating fangorn. A page refresh negates this error

I had thought of a few ways to negate this, but decided against it as it would add quite a few more DV api calls to the process. If it is determined in CR that we want to do that, then it can be implemented

QA Notes

The effect file types are as follows: .csv, .RData, .sav, .dta, .por, .xlsx
If you upload one of these files types to DV, it will be converted into a .tab file via their ingestion process.

Steps:

  • there are some sample file types on the JIRA ticket to use.
  • do not batch upload all the files (DV will return a 400 since it will get stuck on one of the ingestion operations)
  • One by one, upload each file
  • it should complete successfully and show up in fangorn as <name.tab>.
  • clicking on it should go to the right file.

Steps for 'side-effects' notes:

  • upload a file for the second time
  • it will look like 2 of the same file are on the page
  • refresh the page and one of them should show up with a -1

Deployment Notes

@AddisonSchiller AddisonSchiller force-pushed the feature/SVCS-353-davaverse-csv-bug branch from cf0afbc to 3ada43b Compare November 13, 2017 16:44
@coveralls
Copy link

coveralls commented Nov 13, 2017

Coverage Status

Coverage increased (+0.03%) to 89.136% when pulling 3ada43b on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 473191c on CenterForOpenScience:develop.

Copy link
Member

@TomBaxter TomBaxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one issue, annotated over three comments.

After a little testing and reviewing of the Dataverse API I think we need to go a little further here. As you already noted, Dataverse is saving certain file types as tabular files. As this PR stands we present the '.tab" file extension and return the tabular format when downloading.

We can't allow our users data to be abused this way. Fitz has pinged product about updating the terms of service for the Dataverse provider.

Though we may have to let this aggression stand to some extent, I think we might be able to mask our users from it.

Since we have a means of identifying the original file format we can hopefully present that file extension to them. Also there is a download parameter "format=original" that we should be able to use to get the original file back. http://guides.dataverse.org/en/latest/api/dataaccess.html?highlight=download#id4

I think it is worth giving it a try. To be honest, if we can't guarantee the data integrity, I'd almost prefer to just not allow those file types to be uploaded. So horrible.


def test_original_ext_from_raw_metadata(self, format_dict):
for key in format_dict:
assert key == dv_utils.original_ext_from_raw_metadata(format_dict[key])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waterbutler.providers.dataverse.utils.ORIGINAL_FORMATS is a dictionary, which are unordered. Therefore this test will fail roughly two out of three runs. There are three keys which will match in dv_utils.original_ext_from_raw_metadata only one of which is 'RData'.

name of the file.
"""

ext = dv_utils.original_ext_from_raw_metadata(self.raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous dv_utils.original_ext_from_raw_metadata comment. Will return more or less at random 'RData', 'rdata', or 'Rdata' for self.raw with 'originalFormatLabel': 'R Data'


},
# Rdata can come in a few different forms, so just list all of them here
'RData': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See two previous comments regarding multiple 'original_label': 'R Data'

@@ -178,7 +178,8 @@ def can_duplicate_names(self):
# Find appropriate version of file
metadata = await self._get_data('latest')
files = metadata if isinstance(metadata, list) else []
file_metadata = next(file for file in files if file.name == path.name)
file_metadata = next(file for file in files if file.name == path.name or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need and extra set of prenths?

Dataverse 'ingests' certain file types. These file types get renamed.
In upload when Waterbutler tries to find the correct metadata to return,
it will 500 since it  was not looking for the renamed file.
@AddisonSchiller AddisonSchiller force-pushed the feature/SVCS-353-davaverse-csv-bug branch from 725e635 to c803684 Compare November 21, 2017 18:46
@AddisonSchiller
Copy link
Contributor Author

Changes on last commit:

original_name -> original_names and is now a list
This gets rid of the need for "RData", "rdata" and "Rdata" etc. It now returns a list of possible names to look for metadata.

If you upload a large RData file, it takes awhile for dataverse to convert it. Any uploads during that time will give a 400 back saying ingestion is currently clogged.
the provider will now look for these messages and display a message to the user that the upload process is clogged and to wait a few seconds and try again.

To test the above upload an RData file. Once its load bar finishes, immediately upload another ingested file type. It should throw the error.

New tests for this new exception handling etc.

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.05%) to 89.992% when pulling c803684 on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 26bf209 on CenterForOpenScience:develop.

@AddisonSchiller AddisonSchiller changed the title [SVCS-353]Look for Dataverse renamed files on upload [SVCS-353] Look for Dataverse renamed files on upload Nov 27, 2017
if not label or not file_format or not content_type:
return None

for key in ORIGINAL_FORMATS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better as:

for key, value in ORIGINAL_FORMATS.items():
if (label == value['original_label'] and ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can apply the above to TestUtils class as well.

await resp.release()

# Find appropriate version of file
metadata = await self._get_data('latest')
files = metadata if isinstance(metadata, list) else []
file_metadata = next(file for file in files if file.name == path.name)
file_metadata = next(file for file in files if (file.name == path.name or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just use a list comprehension here?

@coveralls
Copy link

coveralls commented Nov 30, 2017

Coverage Status

Coverage increased (+0.05%) to 89.992% when pulling 9f0cdbb on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 26bf209 on CenterForOpenScience:develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants