Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer #78

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Add tokenizer #78

wants to merge 27 commits into from

Conversation

Koeng101
Copy link
Owner

This PR is creating a tokenizer in the dnadesign lib. This is primarily for tokenizing amino acids for consumption of an LLM - in particular, llm.c.

@Koeng101
Copy link
Owner Author

I'd like to make the shard-writer to be a little smaller, and more specific to just receive tokens and write em. Maybe as a concurrent process.

I want to be able to encode pfam in the lead-up to peptides. [PFAM][AA seq][EOS]. The idea here is that you could throw a PFAM to predict the next tokens.

@Koeng101
Copy link
Owner Author

Koeng101 commented Jul 4, 2024

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

In uniref90 there are roughly 65B tokens. Encoded as uint8, that's like 60GB, plus I bet I could shave off a little if I zstd encoded it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant