Skip to content

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aditya0by0 opened this issue Apr 22, 2025 · 1 comment · May be fixed by #84
Open

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

aditya0by0 opened this issue Apr 22, 2025 · 1 comment · May be fixed by #84
Assignees
Labels
enhancement New feature or request priority: low Issue with low priority

Comments

@aditya0by0
Copy link
Member

aditya0by0 commented Apr 22, 2025

Description:

Currently, in ChemDataReader, we are using a list to store tokens from the token.txt file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.

with open(self.token_path, "r") as pk:
    self.cache = [x.strip() for x in pk]
...
if not str(token) in self.cache:
    self.cache.append(str(token))
return self.cache.index(str(token)) + EMBEDDING_OFFSET

Problem:

This design causes the overall preprocessing complexity for encoding token indices from data.pkl to data.pt pre-preprocessing stage to be:

O(Dataset Size × Tokens per Input × 2 × Vocab Size)

This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).

Proposal:

Refactor to use a dict for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.

Benefits:

  • Improved time complexity:
    From
    O(Dataset Size × Tokens per Input × 2 × Vocab Size)
    to
    O(Dataset Size × Tokens per Input)

  • Memory-efficient and order-preserving:
    From Python 3.7+, dict preserves insertion order (PEP 468, What's New in Python 3.7)
    — helpful if the order of tokens is important when saving/loading.

@sfluegel05, I can raise a PR for this if approved.

@aditya0by0 aditya0by0 added enhancement New feature or request priority: low Issue with low priority labels Apr 22, 2025
@aditya0by0 aditya0by0 self-assigned this Apr 22, 2025
@aditya0by0
Copy link
Member Author

However, the dict-based approach increases space complexity, as both the tokens and their corresponding indices need to be stored in memory.

But I assume that memory usage is typically not a concern during preprocessing, and the performance gains in lookup time justify this trade-off.

@aditya0by0 aditya0by0 linked a pull request Apr 25, 2025 that will close this issue
@aditya0by0 aditya0by0 linked a pull request Apr 25, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: low Issue with low priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant