Support for split UTF-8 sequences #5

srogmann · 2024-07-07T22:00:39Z

I like your Llama3 implementation using the Vector API.

Here is a pull request to handle split UTF-8 sequences.

An example is the prompt "How to write 'three little cats' in chinese? Add an emoji.".
In this example the UTF-8 bytes of the cat emoji U+1F638 may be split by Llama-3 into 240, 159, 152 in the first event and the missing 184 in the next event.

mukel · 2024-07-07T22:16:56Z

Thanks for the PR!
I was looking for a general fix that worked also for streaming; I think this only works for decoding of full token sequences.
When streaming tokens, it's possible to get a partial codepoint, I think the fix should be something similar, hold the partial codepoint until it is complete and can be printed.
Also, the UTF-8 bytes cannot be trusted to be valid.
Will take a closer look tomorrow.

srogmann · 2024-07-08T11:29:30Z

When streaming tokens, it's possible to get a partial codepoint

The byte-array in the fix is used to collect a partial codepoint to support streaming.

Also, the UTF-8 bytes cannot be trusted to be valid.

I hadn't wrong UTF-8 bytes in my examples, so there is no check for bit-mask 0b10...... in bytes 2, 3, 4.

srogmann · 2024-07-10T21:24:30Z

I was wondering if using a record array could be an alternative to the if-chain:

record Utf8Mask(int mask, int pattern, int len) {
    static final Utf8Mask[] MASKS = {
            new Utf8Mask(0b11100000, 0b11000000, 2),
            new Utf8Mask(0b11110000, 0b11100000, 3),
            new Utf8Mask(0b11111000, 0b11110000, 4)
    };
}

[...]

                for (Utf8Mask utf8Mask : Utf8Mask.MASKS) {
                    if ((b & utf8Mask.mask()) == utf8Mask.pattern()) {
                        currUtf8Mask = utf8Mask;
                        bufUtf8[currUtf8Index++] = b;
                        continue loopDecoded;
                    }
                }

patch_record_Utf8Mask.txt

mukel · 2024-11-12T16:06:21Z

I looked at this and I think is better to handle it externally e.g. by the consumer of the tokens.
The idea is: instead of writing tokens one by one during streaming, use a stateful TokenDecoder where tokens are pushed one by one and a String of fully "completed" characters comes out (possibly empty, if the sequence is finished), this will also handle malformed UTF8 sequences. I already have a rough prototype.

Support for split UTF-8 sequences.

05bbe1f

mukel self-assigned this Jul 7, 2024

mukel closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for split UTF-8 sequences #5

Support for split UTF-8 sequences #5

Uh oh!

srogmann commented Jul 7, 2024

Uh oh!

mukel commented Jul 7, 2024

Uh oh!

srogmann commented Jul 8, 2024

Uh oh!

srogmann commented Jul 10, 2024

Uh oh!

mukel commented Nov 12, 2024

Uh oh!

Uh oh!

Support for split UTF-8 sequences #5

Support for split UTF-8 sequences #5

Uh oh!

Conversation

srogmann commented Jul 7, 2024

Uh oh!

mukel commented Jul 7, 2024

Uh oh!

srogmann commented Jul 8, 2024

Uh oh!

srogmann commented Jul 10, 2024

Uh oh!

mukel commented Nov 12, 2024

Uh oh!

Uh oh!