-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate tokenization utilities #18
Comments
Syncode's strategy looks pretty haphazardous. |
Just pushed a first draft for this in f79dd57 (I referenced the wrong issue number in the commit msg). This code is based off of https://github.com/hudson-ai/guidance/blob/main/guidance/models/transformers/_transformers.py. |
The key idea with the guidance code is to convert the strings given by However, the byte decoding idea leads to two issues:
|
Do the surface form collisions affect |
By "primary index" for a surface form, |
Related to issue 1: It seems like not all byte tokens are immediately decodable as s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
byte_decoder = gpt2_tokenizer.byte_decoder
reconstructed = b""
for i in gpt2_tokenizer.encode(s):
token_bytes = bytes([byte_decoder[c] for c in gpt2_tokenizer.convert_ids_to_tokens(i)])
reconstructed += token_bytes
try:
print(f'TOKEN {token_bytes.decode()}')
except UnicodeDecodeError:
print(f'Failed {token_bytes}')
try:
print(f'PREFIX {reconstructed.decode()}')
except UnicodeDecodeError:
print(f'Failed {reconstructed}')
assert reconstructed.decode() == s # PASSES Prints: Failed b'\xe2\x80'
Failed b'\xe2\x80'
Failed b'\x99'
PREFIX ’
TOKEN •
PREFIX ’•
TOKEN ¶
PREFIX ’•¶
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88'
Failed b'\x82'
PREFIX ’•¶∂
Failed b'\xc6'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6'
Failed b'\x92'
PREFIX ’•¶∂ƒ
Failed b'\xcb'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb'
Failed b'\x99'
PREFIX ’•¶∂ƒ˙
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88'
Failed b'\x86'
PREFIX ’•¶∂ƒ˙∆
TOKEN £
PREFIX ’•¶∂ƒ˙∆£
Failed b'\xc4'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4'
Failed b'\xa6'
PREFIX ’•¶∂ƒ˙∆£Ħ
Failed b'\xe7'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7'
Failed b'\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨
Failed b'\xe0'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0'
Failed b'\xb5'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5'
Failed b'\xa0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠ
Failed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1'
Failed b'\x85'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85'
Failed b'\x98'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ
Failed b'\xe2\x88'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88'
Failed b'\xb0'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰
Failed b'\xe1'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1'
Failed b'\x8d'
Failed b'\xe2\x80\x99\xe2\x80\xa2\xc2\xb6\xe2\x88\x82\xc6\x92\xcb\x99\xe2\x88\x86\xc2\xa3\xc4\xa6\xe7\x88\xa8\xe0\xb5\xa0\xe1\x85\x98\xe2\x88\xb0\xe1\x8d'
Failed b'\xa8'
PREFIX ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨ |
https://github.com/microsoft/semantic_parsing_with_constrained_lm/blob/main/src/semantic_parsing_with_constrained_lm/tokenization.py
The text was updated successfully, but these errors were encountered: