DNA and tokens
For GENA_LM tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('aglabx/dna_tokens', force_download=True, use_fast=True)
print(tokenizer.vocab_size)
tokenizer.tokenize(dna_data.upper())
For 16S tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('aglabx/16S_1024_bpe_tokens', force_download=True, use_fast=True)
print(tokenizer.vocab_size)
tokenizer.tokenize(dna_data.upper())