-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SVCS-353] Look for Dataverse renamed files on upload #300
base: develop
Are you sure you want to change the base?
[SVCS-353] Look for Dataverse renamed files on upload #300
Conversation
cf0afbc
to
3ada43b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one issue, annotated over three comments.
After a little testing and reviewing of the Dataverse API I think we need to go a little further here. As you already noted, Dataverse is saving certain file types as tabular files. As this PR stands we present the '.tab" file extension and return the tabular format when downloading.
We can't allow our users data to be abused this way. Fitz has pinged product about updating the terms of service for the Dataverse provider.
Though we may have to let this aggression stand to some extent, I think we might be able to mask our users from it.
Since we have a means of identifying the original file format we can hopefully present that file extension to them. Also there is a download parameter "format=original" that we should be able to use to get the original file back. http://guides.dataverse.org/en/latest/api/dataaccess.html?highlight=download#id4
I think it is worth giving it a try. To be honest, if we can't guarantee the data integrity, I'd almost prefer to just not allow those file types to be uploaded. So horrible.
|
||
def test_original_ext_from_raw_metadata(self, format_dict): | ||
for key in format_dict: | ||
assert key == dv_utils.original_ext_from_raw_metadata(format_dict[key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waterbutler.providers.dataverse.utils.ORIGINAL_FORMATS is a dictionary, which are unordered. Therefore this test will fail roughly two out of three runs. There are three keys which will match in dv_utils.original_ext_from_raw_metadata only one of which is 'RData'.
name of the file. | ||
""" | ||
|
||
ext = dv_utils.original_ext_from_raw_metadata(self.raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to previous dv_utils.original_ext_from_raw_metadata comment. Will return more or less at random 'RData', 'rdata', or 'Rdata' for self.raw with 'originalFormatLabel': 'R Data'
|
||
}, | ||
# Rdata can come in a few different forms, so just list all of them here | ||
'RData': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See two previous comments regarding multiple 'original_label': 'R Data'
@@ -178,7 +178,8 @@ def can_duplicate_names(self): | |||
# Find appropriate version of file | |||
metadata = await self._get_data('latest') | |||
files = metadata if isinstance(metadata, list) else [] | |||
file_metadata = next(file for file in files if file.name == path.name) | |||
file_metadata = next(file for file in files if file.name == path.name or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need and extra set of prenths?
Dataverse 'ingests' certain file types. These file types get renamed. In upload when Waterbutler tries to find the correct metadata to return, it will 500 since it was not looking for the renamed file.
725e635
to
c803684
Compare
Changes on last commit:
If you upload a large RData file, it takes awhile for dataverse to convert it. Any uploads during that time will give a 400 back saying ingestion is currently clogged. To test the above upload an RData file. Once its load bar finishes, immediately upload another ingested file type. It should throw the error. New tests for this new exception handling etc. |
if not label or not file_format or not content_type: | ||
return None | ||
|
||
for key in ORIGINAL_FORMATS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better as:
for key, value in ORIGINAL_FORMATS.items():
if (label == value['original_label'] and ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can apply the above to TestUtils class as well.
await resp.release() | ||
|
||
# Find appropriate version of file | ||
metadata = await self._get_data('latest') | ||
files = metadata if isinstance(metadata, list) else [] | ||
file_metadata = next(file for file in files if file.name == path.name) | ||
file_metadata = next(file for file in files if (file.name == path.name or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just use a list comprehension here?
Dataverse 'ingests' certain file types. These file types get renamed.
In upload when Waterbutler tries to find the correct metadata to return,
it will 500 since it was not looking for the renamed file.
Ticket
https://openscience.atlassian.net/browse/SVCS-353
Purpose
Uploading some file types to dataverse would cause a 500 error to be thrown in waterbutler. The file would still upload, but would have a new extension.
Changes
Allow waterbutler to return the correct metadata based on the current file name, as well as a new property called
file.original_name
. This new property will parse DV raw metadata to figure out what the original file type was. This way WB can return the correct metadata and won't throw a 500.added tests for new functions.
Side effects
Currently DV just deletes old files if you try to upload a conflicting file name. (unsure if this is correct functionality or not)
Because DV changes the file names when uploading certain file types (csv, xlsx, and others), it wont be able to delete the original because it will look like it does not exist. Instead DV uses their own renaming scheme , <conflicting-file-2< ...
Because of this, when a file that DV renames gets uploaded, if it got renamed with a -1, or a -2 etc, WB won't look for that. instead it will return the metadata of the non-renamed version.
This shouldn't be much of a problem, as it is really only used for updating fangorn. A page refresh negates this error
I had thought of a few ways to negate this, but decided against it as it would add quite a few more DV api calls to the process. If it is determined in CR that we want to do that, then it can be implemented
QA Notes
The effect file types are as follows: .csv, .RData, .sav, .dta, .por, .xlsx
If you upload one of these files types to DV, it will be converted into a
.tab
file via theiringestion
process.Steps:
Steps for 'side-effects' notes:
Deployment Notes