Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for split UTF-8 sequences #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

srogmann
Copy link

@srogmann srogmann commented Jul 7, 2024

Hi @mukel,

I like your Llama3 implementation using the Vector API.

Here is a pull request to handle split UTF-8 sequences.

An example is the prompt "How to write 'three little cats' in chinese? Add an emoji.".
In this example the UTF-8 bytes of the cat emoji U+1F638 may be split by Llama-3 into 240, 159, 152 in the first event and the missing 184 in the next event.

@mukel mukel self-assigned this Jul 7, 2024
@mukel
Copy link
Owner

mukel commented Jul 7, 2024

Thanks for the PR!
I was looking for a general fix that worked also for streaming; I think this only works for decoding of full token sequences.
When streaming tokens, it's possible to get a partial codepoint, I think the fix should be something similar, hold the partial codepoint until it is complete and can be printed.
Also, the UTF-8 bytes cannot be trusted to be valid.
Will take a closer look tomorrow.

@srogmann
Copy link
Author

srogmann commented Jul 8, 2024

When streaming tokens, it's possible to get a partial codepoint

The byte-array in the fix is used to collect a partial codepoint to support streaming.

Also, the UTF-8 bytes cannot be trusted to be valid.

I hadn't wrong UTF-8 bytes in my examples, so there is no check for bit-mask 0b10...... in bytes 2, 3, 4.

@srogmann
Copy link
Author

I was wondering if using a record array could be an alternative to the if-chain:

record Utf8Mask(int mask, int pattern, int len) {
    static final Utf8Mask[] MASKS = {
            new Utf8Mask(0b11100000, 0b11000000, 2),
            new Utf8Mask(0b11110000, 0b11100000, 3),
            new Utf8Mask(0b11111000, 0b11110000, 4)
    };
}

[...]

                for (Utf8Mask utf8Mask : Utf8Mask.MASKS) {
                    if ((b & utf8Mask.mask()) == utf8Mask.pattern()) {
                        currUtf8Mask = utf8Mask;
                        bufUtf8[currUtf8Index++] = b;
                        continue loopDecoded;
                    }
                }

patch_record_Utf8Mask.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants