-
Notifications
You must be signed in to change notification settings - Fork 300
[WIP] feat: add support for chunked uploads #851
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @hugomrdias!
My understanding is that
- Upload payload is split into small parts (
chunkSize = 256000
)
(we probably want to make it a parameter) - Each part is sent as a a sequence of HTTP POST requests that have
- a unique identifier for entire upload session (uuid? – see below)
- a sequential counter within upload session (a chunk index)
- API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular
ipfs.files.add
call in a transparent manner- PR for js-ipfs: [WIP] feat: support chunked add requests ipfs/js-ipfs#1540
- PR for go-ipfs: (TODO)
- The goal is to hide chunking behind ordinary
ipfs.files.add
but for now/api/v0/add-chunked
is used as the PoC endpoint in js-ipfs
Please let me know if I missed anything or you have a different vision for it.
Some early feedback from my end
- 👍 for moving forward: I am very hopeful this will address various bugs and limitations coming from buffering entire thing in memory in web browser contexts.
(we really need to solve big uploads from the browser.. without crashing it 🙃) - for progress reporting we could so various things, from the top of my head options are:
- A) add another endpoint or a header and periodically send asynchronous request to fetch upload progress updates for specific "group identifier" (unique identifier of ongoing upload session)
- B) do the same thing
files.add
in go-ipfs does right now (streaming status information while processing large requests), but make it aware of total upload size - C) do nothing extra, and report progress based on how many of chunks were uploaded (add ability to control chunk size via param to control progress reporting resolution vs performance)
- On backends:
- js-ipfs: will comment in [WIP] feat: support chunked add requests ipfs/js-ipfs#1540
- go-ipfs: as soon we have a working PoC against js-ipfs we should write a small spec and include go-ipfs in the loop to plan adding support there as well
- Note to self: increased number of HTTP calls might produce a bigger surface for
http: invalid Read on closed Body
to occur when used with go-ipfs.
- Note to self: increased number of HTTP calls might produce a bigger surface for
- API endpoint: Is the plan to use a separate endpoint or eventually merge support into
/api/v0/add
? (I assume the latter) - Added some inline comments with usual bikeshed :))
src/add2/add2.js
Outdated
.then(res => res.json()) | ||
} | ||
|
||
function createName () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ipfs-Chunk-Name
I may be missing something here, but is there a reason why we can't use UUID v5 here? JavaScript's random numbers are usually weak, so v5 sounds like a safer option:
RFC 4122 advises that "distributed applications generating UUIDs at a variety of hosts must be willing to rely on the random number source at all hosts. If this is not feasible, the namespace variant should be used."
Fast generation of RFC4122 UUIDs: require('uuid/v5')
.
If use of UUID here is fine, we may consider renaming the field to -Uuid
or even -Chunk-Group-Uuid
to remove ambiguity.
src/add2/add2.js
Outdated
'Content-Range': `bytes ${start}-${end}/${size}`, | ||
'Ipfs-Chunk-Name': name, | ||
'Ipfs-Chunk-Id': id, | ||
'Ipfs-Chunk-Boundary': boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use X-
prefix for all custom headers?
We already have X-Ipfs-Path
and I wonder if we should follow that convention.
src/add2/add2.js
Outdated
'Content-Type': 'application/octet-stream', | ||
'Content-Range': `bytes ${start}-${end}/${size}`, | ||
'Ipfs-Chunk-Name': name, | ||
'Ipfs-Chunk-Id': id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id
probably should be renamed to Index
(its index everywhere else)
@lidel your understanding is correct :), updated the PR with some of your feedback regarding the uuid i had looked into it, for now i want to keep the poor man's version should be safe for now, it goes over math.random a couple of times. (i have a note to go back to this) final integration will use the normal add api, only with one change a new option called chunkSize, if this option is set to a number we go through the chunked codepath. about progress im still trying to add directly without files if i succeed this should work the same as right now, if not, one solution i though was adding a new handler the current |
@hugomrdias thanks! My view is that we should do our best to to make it work without changing current What if we detect presence of For upload split into N chunks:
The end result would be a best-effort progress reporting that works with existing API and that is not stuck at 0% until the last chunk and behaves in expected manner (% always grows). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚲 🏡
Some feedback on how to make this feature more dev friendly:
Reducing number of headers
I just noticed we already have kinda unrelated header with chunk
in it: X-Chunked-Output
.
We probably should follow that naming convention somehow, or avoid use of chunk
in names, maybe rename it with upload
. Which got me thinking about the number of new headers this PR introduces and how we can make it simpler.
What if we replace three custom headers with only one + repurpose Content-Range?
I think you already started refactoring in that direction, but just for the record:
X-Chunked-Input: <upload-group-uuid>
Content-Range: <unit> <range-start>-<range-end>/<total-size>
X-Chunked-Input
+ Content-Range
(or X-
version with the same semantics) seem to pass all the info we need for re-assembly chunks in js-ipfs.
Chunked upload should also work without js-ipfs-api
A good indicator of being client-agnostic will be a demo of this type of upload with curl
.
Assuming we have chunks on disk, we should be able to do something like below:
$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 0-1048576/2097152' \
--data-binary @/tmp/chunk1
$ curl -X POST http://127.0.0.1:5001/api/v0/add-chunked \
-H 'Content-Type: application/octet-stream' \
-H 'Content-Disposition: "file; filename="BIG-FILE.avi"' \
-H 'X-Chunked-Input: 87d0d13b-07de-4df1-b274-9b26a07a6f2a' \
-H 'Content-Range: bytes 1048577-2097152/2097152' \
--data-binary @/tmp/chunk2
..and get CID of BIG-FILE.avi
in response for the last request.
How retry/resume will look like?
What happens when n-th chunk fails due to temporary network problem?
js-ipfs-api
: Fail entire upload? Retry failed chunk a few times before throwing an error with the offset of the end of last successful chunk?
Perhaps js-ipfs-api could internally retry chunk three times before returning an error.
js-ipfs
: should there be a way for resuming failed upload from last successful chunk?
@lidel the first two topics should be adressed in the last commit about the resumeable stuff, its mostly:
so, lets leave the resume feature to a follow up PR |
the jsdoc should create some nice docs with documentation.js
also should give code completion to anyone using editors with jsdoc support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, it is looking good to me. Added some minor notes
src/files/add-experimental.js
Outdated
* @typedef {Object} AddResult | ||
* @property {string} path | ||
* @property {string} hash | ||
* @property {number} size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add descriptions for the properties here as well
this.index = 0 | ||
this.rangeStart = 0 | ||
this.rangeEnd = 0 | ||
this.rangeTotal = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using an object to handle all the range related properties?
// end runs | ||
}) | ||
|
||
it.skip('files.add pins by default', (done) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is blocking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't run connected to a js daemon, not related to this PR
}) | ||
}) | ||
|
||
it.skip('files.add with pin=false', (done) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is blocking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't run connected to a js daemon, not related to this PR
@Stebalien could we get your thoughts on adding this to go-ipfs? This PR is adding a feature to the HTTP @lidel kindly put together a good summary of the proposed process:
Reasons for doing this:
|
@hugomrdias @lidel I think that regardless of what happens with this PR we need to switch to using the streaming fetch API. Firefox is notably the only browser that hasn't shipped the streams API yet but it sounds like this might happen soon. I think we can conditionally opt out of it for Firefox for the time being. Switching to using the streaming fetch API will solve the buffering issue without any changes to the HTTP API and depending on priorities for go-ipfs we might be able to ship this before chunked uploads. It's also worth noting that, streaming fetch will be way more efficient then multiple HTTP requests for chunked uploading. |
That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser |
You're absolutely right - my bad. Thanks for clarifying! |
Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:
|
yep i based the impl in tus |
This is still a work in progress to add support for chunked uploads (
ipfs.add
) and fix multiple issues related to adding big files.Tests are filtered here https://github.com/ipfs/js-ipfs-api/blob/90c40363fbcd55d29307e51f4feabb8be867ded8/test/add-experimental.spec.js#L38-L46 to make review easy, just run ipfs daemon with ipfs/js-ipfs#1540
features/fixes in this PR together with ipfs/js-ipfs#1540:
File
's directly from the inputNotes:
Needs:
Todo:
concurrent upload chunksnew PR for thisRelated: