Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVCS-552] folder_file_op cut down revalidate calls. #311

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

Johnetordoff
Copy link
Contributor

@Johnetordoff Johnetordoff commented Dec 20, 2017

Ticket

https://openscience.atlassian.net/browse/SVCS-552
This PR replaces #287

Purpose

folder_file_op makes roughly a trillion metadata calls every time it moves a folder's children. This fix makes it easy to get a path from it's metadata object by giving the metadata an id which corresponds with the path identifier. Since we can now convert a metadata object into a path easily we no longer need to re-validate each time.

Changes

  • Add id property to every provider's metadata.
  • Changes from_metadata_to_path to uses metadata.id property
  • Changes folder_file op to use from_metadata_to_path instead of revalidate_path
  • Adds tests id property
  • Adds tests for folder_file_op

Side effects

None that I know of.

QA Notes

Copy/moves between providers should be unchanged or slightly faster.

Deployment Notes

None.

@coveralls
Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage increased (+0.4%) to 90.34% when pulling 6732325 on Johnetordoff:folder_file_op_fix into 1b60f40 on CenterForOpenScience:develop.

@coveralls
Copy link

coveralls commented Jan 2, 2018

Coverage Status

Coverage increased (+0.4%) to 90.197% when pulling b61b2ec on Johnetordoff:folder_file_op_fix into 2f45e0b on CenterForOpenScience:develop.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and I will take over the ticket/PR. Two minor issues:

  • Need to update Google Cloud as well which has been added recently.
  • This change affects all path-based provider and will cause conflicts for sure. Proceed with caution 🚔 .

@@ -133,6 +133,16 @@ def kind(self) -> str:
""" `file` or `folder` """
raise NotImplementedError

@property
@abc.abstractmethod
def id(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the DocStr to mention what this id is for both types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -334,7 +334,7 @@ def request(self, *args, **kwargs):

folder = await dest_provider.create_folder(dest_path, folder_precheck=False)

dest_path = await dest_provider.revalidate_path(dest_path.parent, dest_path.name, folder=dest_path.is_dir)
folder_path = dest_provider.path_from_metadata(dest_path.parent, folder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename folder to folder_metadata (including the assign statement two lines above).

@@ -24,6 +24,10 @@ def __init__(self, raw, path_obj, owner=None, repo=None):
self.owner = owner
self.repo = repo

@property
def id(self):
return self.build_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since id is exactly the same, why not using return self.path? This is what other providers do.

- This id consolidates both path-based ID and id-
  based ID under a single property.
- It is identical to the path object identifier
  which is used to revalidate paths.
- In addititon, remove superfluous condition
  statement `if not futures: continue`
- `self.id` is identical to `self.path` since GCS
  is a path-based provider
- Intermediate commit. Please rebase or ammend.
- Discuss the todos and comments, fix issues if
  any. Update docstring, comments and remove
  todos.
- Need travis ci.
Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Johnetordoff and CC @felliott

I rebased and updated the code with extra comments for each provider affected. Need some further discussion with you before I can continue. Thanks.

@@ -27,6 +27,10 @@ class BaseGoogleCloudMetadata(metadata.BaseMetadata, metaclass=abc.ABCMeta):
def provider(self) -> str:
return 'googlecloud'

@property
def id(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add GoogleCloud to pass tests.

@@ -133,6 +133,16 @@ def kind(self) -> str:
""" `file` or `folder` """
raise NotImplementedError

@property
@abc.abstractmethod
def id(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -27,8 +27,11 @@ class BaseGoogleCloudMetadata(metadata.BaseMetadata, metaclass=abc.ABCMeta):
def provider(self) -> str:
return 'googlecloud'

# GoogleCloud is a path-based provider. However, it never calls ``revalidate_path()`` since it
# is the backend storage provider for OSFStorage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to CloudFiles (when it was the storage provider), GoogleCloud never uses revalidate_path(). We should set the id to be None or empty so we know it is never used accidentally or cause any confusion in the further.

@property
def provider(self):
return 'github'

# GitHub is a path-based provider. However, it has its own ``revalidate_path()``.
# TODO: Investigate if this is an issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub may not work because your fix calls (functionally, not literally) the default revalidate_path() instead of the customized one.

@property
def provider(self):
return 'dropbox'

# TODO: Is Dropbox path-based instead of id-based?
# TODO: Dropbox does not have a ``revalidate_path()``, should we use path instead of id?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no revalidate_path() for dropbox which makes guess that it is a path based (or at lease both)?

@property
def provider(self):
return 'dataverse'

# TODO: Should this be in ``DataverseFileMetadata``?
# TODO: Does ``self.raw['id']`` exists for ``DataverseDatasetMetadata``?
# TODO: DataVerse's ``revalidate_path()`` looks very different. Is this an issue?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to GitHub, your fix calls (functionally, not literally) revalidate_path() differently than the original way. This might be an issue.

@property
def provider(self):
return 'cloudfiles'

# CloudFiles is a path-based provider. However, it never called ``revalidate_path()`` even when
# it was the backend storage provider for OSFStorage. The provider is no longer active until it
# is upgraded to an addon provider for the users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See GoogleCloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants