Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

Questions relating to pre-training #196

Answered by tomsercu
vigneshvalliappan asked this question in Q&A
Discussion options

You must be logged in to vote
  1. Yes in the end everything is vocab size 33, you could see the 33 as an implementation detail ("system" special tokens and mask token); while the actual sequences themselves will only use part of the vocab of size 25 of 20 AA + ambiguous AA
  2. Clustering is commonly done eg for sequence search. In the case of pre-training yes it reduces near-duplicates and has a rebalancing effect.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@vigneshvalliappan
Comment options

@tomsercu
Comment options

@vigneshvalliappan
Comment options

Answer selected by vigneshvalliappan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants