Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ft/dropbox chunked uploads #11

Open
wants to merge 175 commits into
base: dropbox-chunked-uploads
Choose a base branch
from

Conversation

NyanHelsing
Copy link

No description provided.

felliott and others added 30 commits March 7, 2018 14:27
 * According to the RFC[1], a server may ignore a Range header and
   should ignore a Range header containing units it doesn't
   understand.  WB was erroring under these conditions, but it seems
   more appropriate to ignore the field.  Testing against external
   providers showed no consistent practice to follow.  The
   parse_request_range docs have been updated to reflect the new
   behavior.  Big thanks to @birdbrained for doing the legwork on
   researching this!

 [1] https://tools.ietf.org/html/rfc7233#section-3.1
This is a rebased and squashed commit of a fully working
implementation for the provider based on Google Cloud
Storage's JSON API and OAuth 2.0 protocol.

For detailed commits and messages, please see this PR:
CenterForOpenScience#317
- Removed functionalities that are not used by
  OSFStorage and that do not have documented
  support
- Removed JSON API related docstrings/comments
- Added TODOs on what needs to be refactored
- Update tests
- updated settings to use XML API
- updated metadata.py to parse response headers
  and init metadata object
- added a helper function in utils.py to convert
  GCS's base64-encoded hash to hex digest
- refactor the structure of fixtures and updated
  them with real responses from Postman tests
- Fully refactored all provider actions to use
  XML API and signed request and implemented a
  minimal version:
    - Upload
    - Download
    - Metadata for file
    - Delete file
    - Intra-copy file
- Added TODO comments for Phase 1, 1.5 and 2:
    - Create folder
    - Metadata for folder
    - Delete folder
    - Intra-copy folder
- Rewrote the provider's `.build_url` with
  `.build_and_sign_req_url` to take care of the
  url buidling and request signing together
Discovered and fixed a few issues during OSF integration
test and updated comments and docstr.

- Main issue: aiohttp parses the `x-goog-hash` correctly
  and returns an `aiohttp.MultiDict` dictionary that
  contains two entries with the same key, one for crc32c
  and one for md5. Fix: updated headers parsing and tests
- Minor fixes/updates
  - Updated HTTP method for delete and intra-copy
  - Added bucket to "x-goog-copy-source" and no longer
    convert value to lower case for request signing
  - Strip '"' from "ETag" for verifying upload checksum
  - Removed "Content-length" and only use "x-goog-stored-
    content-length" for file size
  - Prefixed `build_and_sign_req_url()` with `_`
Note: environment variables are passed in as string
 Some tests and refactors to come, but provider is ready for testing.

 [SVCS-617]
 Closes: CenterForOpenScience#322
- WB only needs to know the bucket name, OSF
  handles the region and select the bucket with
  the expected region for WB.
- Removed deprecated `fixture.py`, which is now
  replaced by `providers.py`, `folders.py` and
  `files.py` in the `fixtures/` directory.
- Updated fixutres for CRUD operations and added
  back CRUD tests that were accidentally removed
- Fix exipration check in utility tests where it
  now use the `settings.SIGNATURE_EXPIRATION` to
  calculate the expected exipiration time.

Remove redundant type casting for expiration

- Type casting is now done in the settings.py. I
  assume that this piece of code is accidentally
  left here during the last merge.
- Fix imports order. `typing` is a standard lib
- Use `+=` for request segments contatenation
- Replace `return True if <condition> else False`
  with `return bool(<condition>)` where the
  condition is `None` or `not None`
- Update both the metadata method and the utility
  function to use the strict regex matching based
  on the RFC specification for Base 64 encoded
  crc32c and md5 hashes
- Google Cloud uses the standard alphabets for
  Base 64 encoding: [A-Za-z0-9+/=]
- RFC reference:
  http://www.rfc-editor.org/rfc/rfc4648.txt
- Added an alternative constructor for the base
  metadata which takes either an standard python
  dict or a pair of object name and multi-value
  dict during initialization
- Updated its usage in the provider
- Changed @staticmethod to @classmethod for
  `get_metadata_from_resp_headers()`
- Added metadata tests for both successful and
  failed initialiaztion.
- Added `get_multi_dict_from_json()` (and a test
  for it) to utils so that all tests now use this
  helper method to build resp headers
- Refactored the metadata structure to test three
  classes separately
- Removed import alias such as `core_exception`,
  `pd_settings` and `pd_utils` since there is no
  shadowing issues any more
- Both GC file and folder metadata now uses this
  decidated method to initialize with object name
  and aiohttp's "MultiDict" response headers
- Removed the alternative constructor for GC base
  metadata and updated related code and tests
- Moved quirks in comments into DocStr so that
  that they are available in WB Docs
- Use `` (double tildes) for code in DocStr
- Fixed Sphinx warnings
Side effect: modified the function signature for
get_multi_dict_from_python_dict() to expect dict
instead of json; updated all related tests.
- GC's private memebers are now availabe in the
  WB Docs by adding :private-members:
- Remove :inherited-members: and :undoc-members:
- Updated the .rst files to include GC, metadata
  and utils
- mppy doesn't handle class inheritence well
- mypy has a problem with **{} arguments
- mypy doesn't handle multiple options of
  return type well
- Upload fails due to CIMultiDcitProxy inherits
  from MultiDictProxy but not from MultiDict.
  However, aiohttpretty returns CIMultiDict while
  aiohttp returns CIMultiDictProxy, the type is
  strict and only recognize CIMultiDict
- Updated the check statement with the either
  MultiDict or MultiDictProxy:
  - WB code uses CIMultiDcitProxy (a subclass of
  MultiDictProxy) since aiohttp parses the hash
  headers already.
  - WB test code uses MultiDict to modify the
  dictionary in get_multi_dict_from_python_dict()
  which returns MultiDcitProxy.
 * Minor updates to formatting of method sigantures, import order, and
   judicious application of DeMorgan's Law to make conditionals more
   readable.

 * Update formatting of docstrings to make Sphinx docs more
   cross-linked and browsable.
 * Minor style fixes.

 * Update formatting of docstrings to make Sphinx docs more
   cross-linked and browsable.
[SVCS-NO-TICKET] Improve The Command `invoke test`
Closes: CenterForOpenScience#353
@cslzchen cslzchen force-pushed the ft/dropbox-chunked-uploads branch 10 times, most recently from e2718b1 to 4913bc4 Compare July 19, 2018 16:44
felliott and others added 18 commits July 25, 2018 10:35
 * Analytics fields will be left in and hardcoded to `None` to avoid
   changing the schema.
 - For chunked upload, add `upload_part()` to handle one chunk upload
   and `upload_part()` now calls `upload_part()`.

 - For normal upload, move the code into `contiguous_upload()`
- According to Dropbox API docs, files larger than 150 MB must
  use chunked upload. Chunks can be any size up to 150 MB. A
  typical size is 4 MB, which is what WB uses. The max file
  size can be uploaded is 350 GB.
 * The dropbox provider should error if it can't get a session
   identifier.  A `KeyError` should suffice until we see what an
   actual error condition looks like.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants