Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate tokenization utilities #18

Open
benlebrun opened this issue Jun 27, 2024 · 7 comments
Open

Integrate tokenization utilities #18

benlebrun opened this issue Jun 27, 2024 · 7 comments
Assignees

Comments

@benlebrun
Copy link
Collaborator

@benlebrun benlebrun self-assigned this Jun 27, 2024
@timvieira
Copy link
Collaborator

Syncode's strategy looks pretty haphazardous.

def get_vocab_from_tokenizer(tokenizer):

@benlebrun
Copy link
Collaborator Author

benlebrun commented Jul 11, 2024

Just pushed a first draft for this in f79dd57 (I referenced the wrong issue number in the commit msg). This code is based off of https://github.com/hudson-ai/guidance/blob/main/guidance/models/transformers/_transformers.py.

@benlebrun
Copy link
Collaborator Author

The key idea with the guidance code is to convert the strings given by tokenizer.convert_ids_to_tokens into byte strings using a byte_decoder. This handles all the weird conventions tokenizers use (e.g., Ġ to encode ' ') by mapping the weird characters to the correct byte value e.g., bytes([byte_decoder[c] for c in 'Ġcan']).decode() == ' can'. This handles A LOT of special cases which we hadn't hard-coded ourselves in our current implementation.

However, the byte decoding idea leads to two issues:

  1. bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(token_id)]).decode('utf-8') triggers UnicodeDecodeErrors for some token_ids
  2. There are many duplicate tokens in tokenizers (two ids with the same surface form), and we need a principled way to handle these. Our previous method was kind of hacky and I am not sure it can be used in this case.

@timvieira
Copy link
Collaborator

timvieira commented Jul 11, 2024

Do the surface form collisions affect guidance? Maybe they resolve it somewhere else in the code? That codebase is so giant I don't really know where to start with it...

@benlebrun
Copy link
Collaborator Author

guidance tracks the duplicates, see here. They also move all probability from the duplicate positions to their "primary index" (see here). The duplicates are only a problem for encoding, and guidance seems to encode their byte strings using the tokenizer's encode method (see here).

@benlebrun
Copy link
Collaborator Author

By "primary index" for a surface form, guidance seems to mean the smallest token_id which is decoded to that surface form. I am not sure this is correct---the primary index should be the token_id which tokenizer.encode maps the string token to (the canonical token_id), but that is not always the smallest token_id. For example, vocab[165] == vocab[2634] == 'é', but gpt2_tokenizer.encode('é') == [2634].

@benlebrun
Copy link
Collaborator Author

benlebrun commented Jul 11, 2024

Related to issue 1: It seems like not all byte tokens are immediately decodable as utf-8 since they need additional context, either before or after. The following is an example of a string which (a) has tokens that cannot be immediately decoded and (b) has prefixes which cannot be immediately decoded, but is as a whole decodable following a round-trip:

s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
byte_decoder = gpt2_tokenizer.byte_decoder
reconstructed = b""
for i in gpt2_tokenizer.encode(s):
    token_bytes = bytes([byte_decoder[c] for c in gpt2_tokenizer.convert_ids_to_tokens(i)])
    reconstructed += token_bytes
    
    try:
        print(f'TOKEN {token_bytes.decode()}')
    except UnicodeDecodeError:
        print(f'Failed {token_bytes}')
    
    try:
        print(f'PREFIX {reconstructed.decode()}')
    except UnicodeDecodeError:
        print(f'Failed {reconstructed}')

assert reconstructed.decode() == s # PASSES

Prints:

Failed b'\xe2\x80'
Failed b'\xe2\x80'
Failed b'\x99'
PREFIXTOKENPREFIX ’•
TOKENPREFIX ’•¶
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88'
Failed b'\x82'
PREFIX ’•¶∂
Failed b'\xc6'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6'
Failed b'\x92'
PREFIX ’•¶∂ƒ
Failed b'\xcb'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb'
Failed b'\x99'
PREFIX ’•¶∂ƒ˙
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88'
Failed b'\x86'
PREFIX ’•¶∂ƒ˙∆
TOKEN £
PREFIX ’•¶∂ƒ˙∆£
Failed b'\xc4'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4'
Failed b'\xa6'
PREFIX ’•¶∂ƒ˙∆£Ħ
Failed b'\xe7'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7'
Failed b'\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨
Failed b'\xe0'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0'
Failed b'\xb5'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5'
Failed b'\xa0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠ
Failed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1'
Failed b'\x85'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85'
Failed b'\x98'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88'
Failed b'\xb0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘFailed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1'
Failed b'\x8d'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1\x8d'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants