-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and rework GPT-TF.js #807
base: develop
Are you sure you want to change the base?
Conversation
…odeling head and attention bias
… implement topk sampling
…ers following GPT2 convention, use LogLayer
… and language modeling head
a0804f9
to
1d88d35
Compare
…Loaders now yield token sequences of length blockSize
1d88d35
to
03e5c7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).
btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)
const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, { | ||
padding: false, | ||
truncation: false, | ||
return_tensor: false, | ||
}) | ||
if (tokens.length < blockSize + 1) { | ||
// throw if it happens on the 1st iteration | ||
if (iteration === 0) | ||
throw new Error(`the chunk (${tokens.length} tokens) is too small ` + | ||
`to get a sequence of length blockSize (${blockSize + 1} tokens). ` + | ||
`Either the text file or the chunk size (${chunkBitSize} bits) is too small.`); | ||
// if this isn't the first iteration we simply skip | ||
// as we expect the last chunk to be potentially smaller than the block size | ||
debug("chunk smaller than block size, loading next chunk") | ||
continue | ||
} | ||
debug("batch per chunk: %o", tokens.length / (batchSize * blockSize)) | ||
let currentPosition = 0; | ||
// yield one block of tokens at a time | ||
while (currentPosition + blockSize + 1 <= tokens.length) { | ||
yield tokens.slice(currentPosition, currentPosition + blockSize + 1); | ||
currentPosition += blockSize; // don't add + 1 here | ||
} | ||
// keep the last tokens for the next chunk | ||
// if this was the last one the remaining tokens are discarded | ||
if (currentPosition < tokens.length) { | ||
// We actually need to decode the tokens to get the leftover text | ||
// instead of simply keeping the remaining tokens. | ||
// this is because the tokens may be different once prepended to the next chunk | ||
// e.g. if the remaining text is ". A" and the next chunk starts with "nother" | ||
// the tokenization will be different than if we simply concatenate the remaining tokens | ||
endOfPreviousChunk = tokenizer.decode( | ||
tokens.slice(currentPosition), | ||
{ skip_special_tokens: true } | ||
) | ||
debug("End of chunk, remaining text: '%s'", endOfPreviousChunk) | ||
} else { | ||
// Note that the difference between tokenizing and then concatenating | ||
// vs concatenating and then tokenizing can happen if their is no | ||
// remaining text. We consider this difference negligible | ||
endOfPreviousChunk = ""; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string
) instead. that also would drop the blockSize
, batchSize
& minChunkSize
argument which aren't really relevant for reading text (separation of concerns and all that)
in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like
loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))
with tokenize updated to accept block/List<string>
instead, and maybe drop the padding (but what would be the behavior at the end of the file?)
const expectedTokens = models.tokenize(tokenizer, text, { | ||
padding: false, | ||
truncation: false, | ||
return_tensor: false | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, better to test for the actual tokens values. it is not really an issue but it assumes that models.tokenize
will always work as expected. also, I built test codebases with a bunch of logic testing like this and while it looks correct, it breaks quite easily.
# Runs tests in parallel with matrix strategy https://docs.cypress.io/guides/guides/parallelization | ||
# https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs | ||
# Also see warning here https://github.com/cypress-io/github-action#parallel | ||
strategy: | ||
fail-fast: false # https://github.com/cypress-io/github-action/issues/48 | ||
matrix: | ||
containers: [1, 2] # Uses 2 parallel instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why parallelism is needed?
log: (message) => { | ||
console.log(message) | ||
return null | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug?
import { loadText } from "@epfml/discojs-web"; | ||
import DatasetInput from "./DatasetInput.vue"; | ||
import FileSelection from "./FileSelection.vue"; | ||
const { task } = defineProps<{ task: Task }>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destructuring props only works in vue >= 3.5 but we are using vue 3.4 here
import { loadText } from "@epfml/discojs-web"; | ||
import DatasetInput from "./DatasetInput.vue"; | ||
import FileSelection from "./FileSelection.vue"; | ||
const { task } = defineProps<{ task: Task }>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried pretty hard to avoid referencing Task
(the Ugly Global as I call it in my sleep), hopefully, it'll won't be needed with the latest changes to loadText
@@ -39,12 +40,12 @@ | |||
against a chosen dataset of yours. Below, once you assessed the model, | |||
you can compare the ground truth and the predicted values | |||
<div class="flex justify-center mt-4"> | |||
<CustomButton @click="startTest()"> test </CustomButton> | |||
<CustomButton @click="startTest()" data-cy="start-test"> test </CustomButton> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I saw that a few times in cypress' examples but I don't like this pattern. it mix test specific code into the rest of the codebase. especially here, it is possible to find a button with a "test" label.
same idea below.
Addresses #654