Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and rework GPT-TF.js #807
base: develop
Are you sure you want to change the base?
Fix and rework GPT-TF.js #807
Changes from all commits
a621ad9
523ccdc
09eacd9
5189c27
8fbefd9
50a2b5f
5ced7ca
91e500d
75a5269
03e5c7d
0a28b81
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also test with "Xenova/gpt2", but our code should work with any tokenizer; can you switch it to another one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to check for the whole dataset content, with
Array.fromAsync
(as locally redone in discojs/src/dataset/dataset.spec.ts), that's more readable and ensure that it doesn't generate more values that expectedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, better to test for the actual tokens values. it is not really an issue but it assumes that
models.tokenize
will always work as expected. also, I built test codebases with a bunch of logic testing like this and while it looks correct, it breaks quite easily.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests more that advertisez, you can simplify it by only checking for the size of each block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider moving the definitions and the actions not neede the file outside of
withFile
(also for the previous ones). you can actually return a value out ofwithFile
if wanted.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please split it into multiple lines using
[…].join(" ")
, that's more readable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated from previous test. consider a common function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be kinda duplicated from blockSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter
string
) instead. that also would drop theblockSize
,batchSize
&minChunkSize
argument which aren't really relevant for reading text (separation of concerns and all that)in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like
with tokenize updated to accept block/
List<string>
instead, and maybe drop the padding (but what would be the behavior at the end of the file?)