Ft/dropbox chunked uploads #11

NyanHelsing · 2018-05-31T21:37:50Z

No description provided.

@birdbrained

* According to the RFC[1], a server may ignore a Range header and should ignore a Range header containing units it doesn't understand. WB was erroring under these conditions, but it seems more appropriate to ignore the field. Testing against external providers showed no consistent practice to follow. The parse_request_range docs have been updated to reflect the new behavior. Big thanks to @birdbrained for doing the legwork on researching this! [1] https://tools.ietf.org/html/rfc7233#section-3.1

This is a rebased and squashed commit of a fully working implementation for the provider based on Google Cloud Storage's JSON API and OAuth 2.0 protocol. For detailed commits and messages, please see this PR: CenterForOpenScience#317

- Removed functionalities that are not used by OSFStorage and that do not have documented support - Removed JSON API related docstrings/comments - Added TODOs on what needs to be refactored - Update tests

- updated settings to use XML API - updated metadata.py to parse response headers and init metadata object - added a helper function in utils.py to convert GCS's base64-encoded hash to hex digest - refactor the structure of fixtures and updated them with real responses from Postman tests

- Fully refactored all provider actions to use XML API and signed request and implemented a minimal version: - Upload - Download - Metadata for file - Delete file - Intra-copy file - Added TODO comments for Phase 1, 1.5 and 2: - Create folder - Metadata for folder - Delete folder - Intra-copy folder - Rewrote the provider's `.build_url` with `.build_and_sign_req_url` to take care of the url buidling and request signing together

Discovered and fixed a few issues during OSF integration test and updated comments and docstr. - Main issue: aiohttp parses the `x-goog-hash` correctly and returns an `aiohttp.MultiDict` dictionary that contains two entries with the same key, one for crc32c and one for md5. Fix: updated headers parsing and tests - Minor fixes/updates - Updated HTTP method for delete and intra-copy - Added bucket to "x-goog-copy-source" and no longer convert value to lower case for request signing - Strip '"' from "ETag" for verifying upload checksum - Removed "Content-length" and only use "x-goog-stored- content-length" for file size - Prefixed `build_and_sign_req_url()` with `_`

Note: environment variables are passed in as string

Some tests and refactors to come, but provider is ready for testing. [SVCS-617] Closes: CenterForOpenScience#322

- WB only needs to know the bucket name, OSF handles the region and select the bucket with the expected region for WB.

[SVCS-617] Closes: CenterForOpenScience#329

- Removed deprecated `fixture.py`, which is now replaced by `providers.py`, `folders.py` and `files.py` in the `fixtures/` directory. - Updated fixutres for CRUD operations and added back CRUD tests that were accidentally removed - Fix exipration check in utility tests where it now use the `settings.SIGNATURE_EXPIRATION` to calculate the expected exipiration time. Remove redundant type casting for expiration - Type casting is now done in the settings.py. I assume that this piece of code is accidentally left here during the last merge.

- Fix imports order. `typing` is a standard lib - Use `+=` for request segments contatenation - Replace `return True if <condition> else False` with `return bool(<condition>)` where the condition is `None` or `not None`

- Update both the metadata method and the utility function to use the strict regex matching based on the RFC specification for Base 64 encoded crc32c and md5 hashes - Google Cloud uses the standard alphabets for Base 64 encoding: [A-Za-z0-9+/=] - RFC reference: http://www.rfc-editor.org/rfc/rfc4648.txt

@staticmethod

- Added an alternative constructor for the base metadata which takes either an standard python dict or a pair of object name and multi-value dict during initialization - Updated its usage in the provider - Changed @staticmethod to @classmethod for `get_metadata_from_resp_headers()` - Added metadata tests for both successful and failed initialiaztion.

- Added `get_multi_dict_from_json()` (and a test for it) to utils so that all tests now use this helper method to build resp headers - Refactored the metadata structure to test three classes separately - Removed import alias such as `core_exception`, `pd_settings` and `pd_utils` since there is no shadowing issues any more

- Both GC file and folder metadata now uses this decidated method to initialize with object name and aiohttp's "MultiDict" response headers - Removed the alternative constructor for GC base metadata and updated related code and tests

- Moved quirks in comments into DocStr so that that they are available in WB Docs - Use `` (double tildes) for code in DocStr - Fixed Sphinx warnings

Side effect: modified the function signature for get_multi_dict_from_python_dict() to expect dict instead of json; updated all related tests.

- GC's private memebers are now availabe in the WB Docs by adding :private-members: - Remove :inherited-members: and :undoc-members: - Updated the .rst files to include GC, metadata and utils

- mppy doesn't handle class inheritence well - mypy has a problem with **{} arguments - mypy doesn't handle multiple options of return type well

- Upload fails due to CIMultiDcitProxy inherits from MultiDictProxy but not from MultiDict. However, aiohttpretty returns CIMultiDict while aiohttp returns CIMultiDictProxy, the type is strict and only recognize CIMultiDict - Updated the check statement with the either MultiDict or MultiDictProxy: - WB code uses CIMultiDcitProxy (a subclass of MultiDictProxy) since aiohttp parses the hash headers already. - WB test code uses MultiDict to modify the dictionary in get_multi_dict_from_python_dict() which returns MultiDcitProxy.

* Minor updates to formatting of method sigantures, import order, and judicious application of DeMorgan's Law to make conditionals more readable. * Update formatting of docstrings to make Sphinx docs more cross-linked and browsable.

* Minor style fixes. * Update formatting of docstrings to make Sphinx docs more cross-linked and browsable.

[SVCS-NO-TICKET] Improve The Command `invoke test` Closes: CenterForOpenScience#353

* Analytics fields will be left in and hardcoded to `None` to avoid changing the schema.

[SVCS-871]

- For chunked upload, add `upload_part()` to handle one chunk upload and `upload_part()` now calls `upload_part()`. - For normal upload, move the code into `contiguous_upload()`

- According to Dropbox API docs, files larger than 150 MB must use chunked upload. Chunks can be any size up to 150 MB. A typical size is 4 MB, which is what WB uses. The max file size can be uploaded is 350 GB.

* The dropbox provider should error if it can't get a session identifier. A `KeyError` should suffice until we see what an actual error condition looks like.

felliott and others added 30 commits March 7, 2018 14:27

Merge branch 'feature/range-header-errors' into develop

265b47c

Add GoogleCloudStorage Provider

ae2524e

This is a rebased and squashed commit of a fully working implementation for the provider based on Google Cloud Storage's JSON API and OAuth 2.0 protocol. For detailed commits and messages, please see this PR: CenterForOpenScience#317

Prepare for XML-API refactor:

d05ec48

- Removed functionalities that are not used by OSFStorage and that do not have documented support - Removed JSON API related docstrings/comments - Added TODOs on what needs to be refactored - Update tests

GCS now obtains credentials from OSF during init

75add86

Replace deprecated oauth2client with google-auth

b00dfb2

Update metadata and utility tests

a3cc17f

Update provider tests and fixtures

cb07ea6

Explicit parse SIGNATURE_EXPIRATION to integer

14388eb

Note: environment variables are passed in as string

Merge branch 'feature/gcloud-provider' into develop

4863533

Some tests and refactors to come, but provider is ready for testing. [SVCS-617] Closes: CenterForOpenScience#322

Remove region from Google Cloud

625ebc4

- WB only needs to know the bucket name, OSF handles the region and select the bucket with the expected region for WB.

Merge branch 'cslzchen-fix/remove-region-from-gc' into develop

4b5f5d7

[SVCS-617] Closes: CenterForOpenScience#329

Explicit parse SIGNATURE_EXPIRATION to integer

9e551d1

Add InvalidProviderConfigError to tests

36fbf3f

Fix minor code style issues:

376969e

- Fix imports order. `typing` is a standard lib - Use `+=` for request segments contatenation - Replace `return True if <condition> else False` with `return bool(<condition>)` where the condition is `None` or `not None`

Add .new_from_resp_headers() to init GC metadata

43153d3

- Both GC file and folder metadata now uses this decidated method to initialize with object name and aiohttp's "MultiDict" response headers - Removed the alternative constructor for GC base metadata and updated related code and tests

Update DocStr and add PyDoc for metadata

b0d385d

- Moved quirks in comments into DocStr so that that they are available in WB Docs - Use `` (double tildes) for code in DocStr - Fixed Sphinx warnings

Update DocStr and PyDoc for utils

2550678

Side effect: modified the function signature for get_multi_dict_from_python_dict() to expect dict instead of json; updated all related tests.

Add DocStr and PyDoc for the provider

6ebaa51

- GC's private memebers are now availabe in the WB Docs by adding :private-members: - Remove :inherited-members: and :undoc-members: - Updated the .rst files to include GC, metadata and utils

Fix mypy with # type: ignore

2982f54

- mppy doesn't handle class inheritence well - mypy has a problem with **{} arguments - mypy doesn't handle multiple options of return type well

code style & docstring updates for metadata.py

ca26ed8

* Minor updates to formatting of method sigantures, import order, and judicious application of DeMorgan's Law to make conditionals more readable. * Update formatting of docstrings to make Sphinx docs more cross-linked and browsable.

code style & docstring updates for utils.py

2f6a4cc

* Minor style fixes. * Update formatting of docstrings to make Sphinx docs more cross-linked and browsable.

Merge branch 'feature/improve-inv-test-cmd' into develop

3984900

[SVCS-NO-TICKET] Improve The Command `invoke test` Closes: CenterForOpenScience#353

cslzchen force-pushed the ft/dropbox-chunked-uploads branch 10 times, most recently from e2718b1 to 4913bc4 Compare July 19, 2018 16:44

felliott and others added 18 commits July 25, 2018 10:35

add X-CSRFToken to acceptable CORS headers

d1b8505

Merge branch 'hotfix/cors-csrf-passthru'

b0d717b

remove ip address from keen analytics

eaf1729

remove never-implemented geolocation code

4a33463

* Analytics fields will be left in and hardcoded to `None` to avoid changing the schema.

Merge branch 'hotfix/anonymize-keen-ip'

9d9ef9b

[SVCS-871]

bump version & update changelog

b329c30

Merge branch 'hotfix/0.40.1'

4ac4668

Merge tag '0.40.1' into develop

3627806

Implement chunked upload + tests for Dropbox

785a25f

reformat complicated any expression

4033bd3

fill out docstring for _chunked_upload

f9cb9e2

Use separate methods for cleaner and simpler code

c2af470

- For chunked upload, add `upload_part()` to handle one chunk upload and `upload_part()` now calls `upload_part()`. - For normal upload, move the code into `contiguous_upload()`

standardize on pd_ for settings, exceptions

9da695e

Obtain chunked upload sizes from Dropbox provider settings

ddcc5a3

- According to Dropbox API docs, files larger than 150 MB must use chunked upload. Chunks can be any size up to 150 MB. A typical size is 4 MB, which is what WB uses. The max file size can be uploaded is 350 GB.

Add CutoffStream to Dropbox chunked upload and update doc string

5d74f99

Rewrite tests, fixtures and test utilities

e89a750

update docstrings

370ffb3

permit error if no session_id is available

edd2208

* The dropbox provider should error if it can't get a session identifier. A `KeyError` should suffice until we see what an actual error condition looks like.

felliott force-pushed the ft/dropbox-chunked-uploads branch from 4913bc4 to edd2208 Compare July 26, 2018 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ft/dropbox chunked uploads #11

Ft/dropbox chunked uploads #11

NyanHelsing commented May 31, 2018

Ft/dropbox chunked uploads #11

Are you sure you want to change the base?

Ft/dropbox chunked uploads #11

Conversation

NyanHelsing commented May 31, 2018