Integrate tokenization utilities #18

benlebrun · 2024-06-27T14:10:36Z

https://github.com/microsoft/semantic_parsing_with_constrained_lm/blob/main/src/semantic_parsing_with_constrained_lm/tokenization.py

timvieira · 2024-07-06T22:41:19Z

Syncode's strategy looks pretty haphazardous.

def get_vocab_from_tokenizer(tokenizer):

benlebrun · 2024-07-11T03:50:35Z

Just pushed a first draft for this in f79dd57 (I referenced the wrong issue number in the commit msg). This code is based off of https://github.com/hudson-ai/guidance/blob/main/guidance/models/transformers/_transformers.py.

benlebrun · 2024-07-11T04:05:46Z

The key idea with the guidance code is to convert the strings given by tokenizer.convert_ids_to_tokens into byte strings using a byte_decoder. This handles all the weird conventions tokenizers use (e.g., Ġ to encode ' ') by mapping the weird characters to the correct byte value e.g., bytes([byte_decoder[c] for c in 'Ġcan']).decode() == ' can'. This handles A LOT of special cases which we hadn't hard-coded ourselves in our current implementation.

However, the byte decoding idea leads to two issues:

bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(token_id)]).decode('utf-8') triggers UnicodeDecodeErrors for some token_ids
There are many duplicate tokens in tokenizers (two ids with the same surface form), and we need a principled way to handle these. Our previous method was kind of hacky and I am not sure it can be used in this case.

timvieira · 2024-07-11T04:20:58Z

Do the surface form collisions affect guidance? Maybe they resolve it somewhere else in the code? That codebase is so giant I don't really know where to start with it...

benlebrun · 2024-07-11T13:53:49Z

guidance tracks the duplicates, see here. They also move all probability from the duplicate positions to their "primary index" (see here). The duplicates are only a problem for encoding, and guidance seems to encode their byte strings using the tokenizer's encode method (see here).

benlebrun · 2024-07-11T14:00:34Z

By "primary index" for a surface form, guidance seems to mean the smallest token_id which is decoded to that surface form. I am not sure this is correct---the primary index should be the token_id which tokenizer.encode maps the string token to (the canonical token_id), but that is not always the smallest token_id. For example, vocab[165] == vocab[2634] == 'é', but gpt2_tokenizer.encode('é') == [2634].

benlebrun · 2024-07-11T14:09:10Z

Related to issue 1: It seems like not all byte tokens are immediately decodable as utf-8 since they need additional context, either before or after. The following is an example of a string which (a) has tokens that cannot be immediately decoded and (b) has prefixes which cannot be immediately decoded, but is as a whole decodable following a round-trip:

s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
byte_decoder = gpt2_tokenizer.byte_decoder
reconstructed = b""
for i in gpt2_tokenizer.encode(s):
    token_bytes = bytes([byte_decoder[c] for c in gpt2_tokenizer.convert_ids_to_tokens(i)])
    reconstructed += token_bytes
    
    try:
        print(f'TOKEN {token_bytes.decode()}')
    except UnicodeDecodeError:
        print(f'Failed {token_bytes}')
    
    try:
        print(f'PREFIX {reconstructed.decode()}')
    except UnicodeDecodeError:
        print(f'Failed {reconstructed}')

assert reconstructed.decode() == s # PASSES

Prints:

Failed b'\xe2\x80'
Failed b'\xe2\x80'
Failed b'\x99'
PREFIX ’
TOKEN •
PREFIX ’•
TOKEN ¶
PREFIX ’•¶
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88'
Failed b'\x82'
PREFIX ’•¶∂
Failed b'\xc6'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6'
Failed b'\x92'
PREFIX ’•¶∂ƒ
Failed b'\xcb'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb'
Failed b'\x99'
PREFIX ’•¶∂ƒ˙
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88'
Failed b'\x86'
PREFIX ’•¶∂ƒ˙∆
TOKEN £
PREFIX ’•¶∂ƒ˙∆£
Failed b'\xc4'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4'
Failed b'\xa6'
PREFIX ’•¶∂ƒ˙∆£Ħ
Failed b'\xe7'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7'
Failed b'\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨
Failed b'\xe0'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0'
Failed b'\xb5'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5'
Failed b'\xa0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠ
Failed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1'
Failed b'\x85'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85'
Failed b'\x98'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88'
Failed b'\xb0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰
Failed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1'
Failed b'\x8d'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1\x8d'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨

benlebrun self-assigned this Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate tokenization utilities #18

Integrate tokenization utilities #18

benlebrun commented Jun 27, 2024

timvieira commented Jul 6, 2024

benlebrun commented Jul 11, 2024 •

edited

Loading

benlebrun commented Jul 11, 2024

timvieira commented Jul 11, 2024 •

edited

Loading

benlebrun commented Jul 11, 2024

benlebrun commented Jul 11, 2024

benlebrun commented Jul 11, 2024 •

edited

Loading

Integrate tokenization utilities #18

Integrate tokenization utilities #18

Comments

benlebrun commented Jun 27, 2024

timvieira commented Jul 6, 2024

benlebrun commented Jul 11, 2024 • edited Loading

benlebrun commented Jul 11, 2024

timvieira commented Jul 11, 2024 • edited Loading

benlebrun commented Jul 11, 2024

benlebrun commented Jul 11, 2024

benlebrun commented Jul 11, 2024 • edited Loading

benlebrun commented Jul 11, 2024 •

edited

Loading

timvieira commented Jul 11, 2024 •

edited

Loading

benlebrun commented Jul 11, 2024 •

edited

Loading