Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and rework GPT-TF.js #807

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from
Open

Fix and rework GPT-TF.js #807

wants to merge 11 commits into from

Conversation

JulienVig
Copy link
Collaborator

Addresses #654

  • Fix weight initialization from zero to random uniform
  • Implement weight sharing between token embeddings and the language modeling head
  • Improve generation with top k sampling option
  • Add seed for deterministic runs
  • Implement text loaders by byte chunk rather than by line which doesn't require to pad each line to the context length

@JulienVig JulienVig self-assigned this Oct 16, 2024
@JulienVig JulienVig force-pushed the 654-improve-gpt-julien branch 19 times, most recently from a0804f9 to 1d88d35 Compare October 23, 2024 11:36
…Loaders now yield token sequences of length blockSize
@JulienVig JulienVig marked this pull request as ready for review October 24, 2024 15:58
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).

btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)

Comment on lines +53 to +94
const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, {
padding: false,
truncation: false,
return_tensor: false,
})
if (tokens.length < blockSize + 1) {
// throw if it happens on the 1st iteration
if (iteration === 0)
throw new Error(`the chunk (${tokens.length} tokens) is too small ` +
`to get a sequence of length blockSize (${blockSize + 1} tokens). ` +
`Either the text file or the chunk size (${chunkBitSize} bits) is too small.`);
// if this isn't the first iteration we simply skip
// as we expect the last chunk to be potentially smaller than the block size
debug("chunk smaller than block size, loading next chunk")
continue
}
debug("batch per chunk: %o", tokens.length / (batchSize * blockSize))
let currentPosition = 0;
// yield one block of tokens at a time
while (currentPosition + blockSize + 1 <= tokens.length) {
yield tokens.slice(currentPosition, currentPosition + blockSize + 1);
currentPosition += blockSize; // don't add + 1 here
}
// keep the last tokens for the next chunk
// if this was the last one the remaining tokens are discarded
if (currentPosition < tokens.length) {
// We actually need to decode the tokens to get the leftover text
// instead of simply keeping the remaining tokens.
// this is because the tokens may be different once prepended to the next chunk
// e.g. if the remaining text is ". A" and the next chunk starts with "nother"
// the tokenization will be different than if we simply concatenate the remaining tokens
endOfPreviousChunk = tokenizer.decode(
tokens.slice(currentPosition),
{ skip_special_tokens: true }
)
debug("End of chunk, remaining text: '%s'", endOfPreviousChunk)
} else {
// Note that the difference between tokenizing and then concatenating
// vs concatenating and then tokenizing can happen if their is no
// remaining text. We consider this difference negligible
endOfPreviousChunk = "";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string) instead. that also would drop the blockSize, batchSize & minChunkSize argument which aren't really relevant for reading text (separation of concerns and all that)

in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like

loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))

with tokenize updated to accept block/List<string> instead, and maybe drop the padding (but what would be the behavior at the end of the file?)

discojs-node/src/loaders/text.ts Show resolved Hide resolved
discojs-node/src/loaders/text.ts Show resolved Hide resolved
cli/src/benchmark_gpt.ts Show resolved Hide resolved
Comment on lines +68 to +72
const expectedTokens = models.tokenize(tokenizer, text, {
padding: false,
truncation: false,
return_tensor: false
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, better to test for the actual tokens values. it is not really an issue but it assumes that models.tokenize will always work as expected. also, I built test codebases with a bunch of logic testing like this and while it looks correct, it breaks quite easily.

Comment on lines +130 to +136
# Runs tests in parallel with matrix strategy https://docs.cypress.io/guides/guides/parallelization
# https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs
# Also see warning here https://github.com/cypress-io/github-action#parallel
strategy:
fail-fast: false # https://github.com/cypress-io/github-action/issues/48
matrix:
containers: [1, 2] # Uses 2 parallel instances
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why parallelism is needed?

Comment on lines +12 to +15
log: (message) => {
console.log(message)
return null
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug?

import { loadText } from "@epfml/discojs-web";
import DatasetInput from "./DatasetInput.vue";
import FileSelection from "./FileSelection.vue";
const { task } = defineProps<{ task: Task }>()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destructuring props only works in vue >= 3.5 but we are using vue 3.4 here

import { loadText } from "@epfml/discojs-web";
import DatasetInput from "./DatasetInput.vue";
import FileSelection from "./FileSelection.vue";
const { task } = defineProps<{ task: Task }>()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried pretty hard to avoid referencing Task (the Ugly Global as I call it in my sleep), hopefully, it'll won't be needed with the latest changes to loadText

@@ -39,12 +40,12 @@
against a chosen dataset of yours. Below, once you assessed the model,
you can compare the ground truth and the predicted values
<div class="flex justify-center mt-4">
<CustomButton @click="startTest()"> test </CustomButton>
<CustomButton @click="startTest()" data-cy="start-test"> test </CustomButton>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I saw that a few times in cypress' examples but I don't like this pattern. it mix test specific code into the rest of the codebase. especially here, it is possible to find a button with a "test" label.
same idea below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants