Fix and rework GPT-TF.js #807

JulienVig · 2024-10-16T16:51:51Z

Addresses #654

Fix weight initialization from zero to random uniform
Implement weight sharing between token embeddings and the language modeling head
Improve generation with top k sampling option
Add seed for deterministic runs
Implement text loaders by byte chunk rather than by line which doesn't require to pad each line to the context length

…odeling head and attention bias

… implement topk sampling

…ers following GPT2 convention, use LogLayer

… and language modeling head

…Loaders now yield token sequences of length blockSize

tharvik

superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).

btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)

tharvik · 2024-10-30T15:37:22Z

discojs-node/src/loaders/text.ts

+      const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, {
+        padding: false,
+        truncation: false,
+        return_tensor: false,
+      })
+      if (tokens.length < blockSize + 1) {
+        // throw if it happens on the 1st iteration
+        if (iteration === 0)
+          throw new Error(`the chunk (${tokens.length} tokens) is too small ` +
+            `to get a sequence of length blockSize (${blockSize + 1} tokens). ` +
+            `Either the text file or the chunk size (${chunkBitSize} bits) is too small.`);
+        // if this isn't the first iteration we simply skip
+        // as we expect the last chunk to be potentially smaller than the block size
+        debug("chunk smaller than block size, loading next chunk")
+        continue
+      }
+      debug("batch per chunk: %o", tokens.length / (batchSize * blockSize))
+      let currentPosition = 0;
+      // yield one block of tokens at a time
+      while (currentPosition + blockSize + 1 <= tokens.length) {
+        yield tokens.slice(currentPosition, currentPosition + blockSize + 1);
+        currentPosition += blockSize; // don't add + 1 here
+      }
+      // keep the last tokens for the next chunk
+      // if this was the last one the remaining tokens are discarded
+      if (currentPosition < tokens.length) {
+        // We actually need to decode the tokens to get the leftover text
+        // instead of simply keeping the remaining tokens.
+        // this is because the tokens may be different once prepended to the next chunk
+        // e.g. if the remaining text is ". A" and the next chunk starts with "nother"
+        // the tokenization will be different than if we simply concatenate the remaining tokens
+        endOfPreviousChunk = tokenizer.decode(
+          tokens.slice(currentPosition),
+          { skip_special_tokens: true }
+        )
+        debug("End of chunk, remaining text: '%s'", endOfPreviousChunk)
+      } else {
+        // Note that the difference between tokenizing and then concatenating
+        // vs concatenating and then tokenizing can happen if their is no
+        // remaining text. We consider this difference negligible
+        endOfPreviousChunk = "";
+      }


duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string) instead. that also would drop the blockSize, batchSize & minChunkSize argument which aren't really relevant for reading text (separation of concerns and all that)

in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like

loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))

with tokenize updated to accept block/List<string> instead, and maybe drop the padding (but what would be the behavior at the end of the file?)

discojs-node/src/loaders/text.ts

cli/src/benchmark_gpt.ts

tharvik · 2024-10-30T16:58:11Z

discojs-node/src/loaders.spec.ts

+      const expectedTokens = models.tokenize(tokenizer, text, {
+        padding: false,
+        truncation: false,
+        return_tensor: false
+      })


yeah, better to test for the actual tokens values. it is not really an issue but it assumes that models.tokenize will always work as expected. also, I built test codebases with a bunch of logic testing like this and while it looks correct, it breaks quite easily.

tharvik · 2024-11-01T14:09:16Z

docs/CONTRIBUTING.md

+    # Runs tests in parallel with matrix strategy https://docs.cypress.io/guides/guides/parallelization
+    # https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs
+    # Also see warning here https://github.com/cypress-io/github-action#parallel
+    strategy:
+      fail-fast: false # https://github.com/cypress-io/github-action/issues/48
+      matrix:
+        containers: [1, 2] # Uses 2 parallel instances


why parallelism is needed?

tharvik · 2024-11-01T14:14:06Z

webapp/cypress.config.ts

+        log: (message) => {
+          console.log(message)
+          return null
+        },


tharvik · 2024-11-01T14:24:58Z

webapp/src/components/dataset_input/TextDatasetInput.vue

 import { loadText } from "@epfml/discojs-web";

 import DatasetInput from "./DatasetInput.vue";
 import FileSelection from "./FileSelection.vue";

+const { task } = defineProps<{ task: Task }>()


destructuring props only works in vue >= 3.5 but we are using vue 3.4 here

tharvik · 2024-11-01T14:29:39Z

webapp/src/components/dataset_input/TextDatasetInput.vue

 import { loadText } from "@epfml/discojs-web";

 import DatasetInput from "./DatasetInput.vue";
 import FileSelection from "./FileSelection.vue";

+const { task } = defineProps<{ task: Task }>()


I tried pretty hard to avoid referencing Task (the Ugly Global as I call it in my sleep), hopefully, it'll won't be needed with the latest changes to loadText

tharvik · 2024-11-01T14:31:34Z

webapp/src/components/testing/TestSteps.vue

@@ -39,12 +40,12 @@
        against a chosen dataset of yours. Below, once you assessed the model,
        you can compare the ground truth and the predicted values
        <div class="flex justify-center mt-4">
-          <CustomButton @click="startTest()"> test </CustomButton>
+          <CustomButton @click="startTest()" data-cy="start-test"> test </CustomButton>


yeah, I saw that a few times in cypress' examples but I don't like this pattern. it mix test specific code into the rest of the codebase. especially here, it is possible to find a button with a "test" label.
same idea below.

JulienVig self-assigned this Oct 16, 2024

JulienVig added 9 commits October 16, 2024 18:55

discojs & cli: change gpt2 vocab size to 50257 instead of 50258

a621ad9

discojs/src/models/gpt/index: link source repo

523ccdc

discojs/src/models/gpt: always use the token embeddings, a language m…

09eacd9

…odeling head and attention bias

discojs/src/models/gpt: generation code: clean, document and improve.…

5189c27

… implement topk sampling

discojs/src/models/gpt/layers: document tensor operations, rename lay…

8fbefd9

…ers following GPT2 convention, use LogLayer

discojs/src/models/gpt/layers: share weights between token embeddings…

50a2b5f

… and language modeling head

discojs/src/models/gpt/layers: fix weight initializations

5ced7ca

cli: add train_gpt script

91e500d

discojs/src/models/gpt: add seed, loss is now identical between runs

75a5269

JulienVig force-pushed the 654-improve-gpt-julien branch 19 times, most recently from a0804f9 to 1d88d35 Compare October 23, 2024 11:36

*: replace line by line text loaders by chunk by chunk text loaders. …

03e5c7d

…Loaders now yield token sequences of length blockSize

JulienVig force-pushed the 654-improve-gpt-julien branch from 1d88d35 to 03e5c7d Compare October 23, 2024 11:58

docs/CONTRIBUTING: add documentation on debug statements and cypress

0a28b81

JulienVig marked this pull request as ready for review October 24, 2024 15:58

JulienVig requested a review from tharvik October 24, 2024 15:58

tharvik requested changes Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and rework GPT-TF.js #807

Fix and rework GPT-TF.js #807

JulienVig commented Oct 16, 2024

tharvik left a comment

tharvik Oct 30, 2024

tharvik Oct 30, 2024

tharvik Nov 1, 2024

tharvik Nov 1, 2024

tharvik Nov 1, 2024

tharvik Nov 1, 2024

tharvik Nov 1, 2024

Fix and rework GPT-TF.js #807

Are you sure you want to change the base?

Fix and rework GPT-TF.js #807

Conversation

JulienVig commented Oct 16, 2024

tharvik left a comment

Choose a reason for hiding this comment

tharvik Oct 30, 2024

Choose a reason for hiding this comment

tharvik Oct 30, 2024

Choose a reason for hiding this comment

tharvik Nov 1, 2024

Choose a reason for hiding this comment

tharvik Nov 1, 2024

Choose a reason for hiding this comment

tharvik Nov 1, 2024

Choose a reason for hiding this comment

tharvik Nov 1, 2024

Choose a reason for hiding this comment

tharvik Nov 1, 2024

Choose a reason for hiding this comment