You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, in ChemDataReader, we are using a list to store tokens from the token.txt file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.
This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).
Proposal:
Refactor to use a dict for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.
Benefits:
Improved time complexity:
From O(Dataset Size × Tokens per Input × 2 × Vocab Size)
to O(Dataset Size × Tokens per Input)
Memory-efficient and order-preserving:
From Python 3.7+, dict preserves insertion order (PEP 468, What's New in Python 3.7)
— helpful if the order of tokens is important when saving/loading.
@sfluegel05, I can raise a PR for this if approved.
The text was updated successfully, but these errors were encountered:
Description:
Currently, in
ChemDataReader
, we are using a list to store tokens from thetoken.txt
file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.
Problem:
This design causes the overall preprocessing complexity for encoding token indices from
data.pkl
todata.pt
pre-preprocessing stage to be:O(Dataset Size × Tokens per Input × 2 × Vocab Size)
This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).
Proposal:
Refactor to use a
dict
for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.Benefits:
Improved time complexity:
From
O(Dataset Size × Tokens per Input × 2 × Vocab Size)
to
O(Dataset Size × Tokens per Input)
Memory-efficient and order-preserving:
From Python 3.7+,
dict
preserves insertion order (PEP 468, What's New in Python 3.7)— helpful if the order of tokens is important when saving/loading.
@sfluegel05, I can raise a PR for this if approved.
The text was updated successfully, but these errors were encountered: