Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

aditya0by0 · 2025-04-22T14:12:29Z

Description:

Currently, in ChemDataReader, we are using a list to store tokens from the token.txt file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.

with open(self.token_path, "r") as pk:
    self.cache = [x.strip() for x in pk]
...
if not str(token) in self.cache:
    self.cache.append(str(token))
return self.cache.index(str(token)) + EMBEDDING_OFFSET

Problem:

This design causes the overall preprocessing complexity for encoding token indices from data.pkl to data.pt pre-preprocessing stage to be:

O(Dataset Size × Tokens per Input × 2 × Vocab Size)

This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).

Proposal:

Refactor to use a dict for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.

Benefits:

Improved time complexity:
From
O(Dataset Size × Tokens per Input × 2 × Vocab Size)
to
O(Dataset Size × Tokens per Input)
Memory-efficient and order-preserving:
From Python 3.7+, dict preserves insertion order (PEP 468, What's New in Python 3.7)
— helpful if the order of tokens is important when saving/loading.

@sfluegel05, I can raise a PR for this if approved.

The text was updated successfully, but these errors were encountered:

aditya0by0 · 2025-04-22T14:52:46Z

However, the dict-based approach increases space complexity, as both the tokens and their corresponding indices need to be stored in memory.

But I assume that memory usage is typically not a concern during preprocessing, and the performance gains in lookup time justify this trade-off.

aditya0by0 added enhancement New feature or request priority: low Issue with low priority labels Apr 22, 2025

aditya0by0 self-assigned this Apr 22, 2025

aditya0by0 linked a pull request Apr 25, 2025 that will close this issue

Optimize token search #84

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

aditya0by0 commented Apr 22, 2025 •

edited

Loading

aditya0by0 commented Apr 22, 2025

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

Comments

aditya0by0 commented Apr 22, 2025 • edited Loading

Problem:

Proposal:

Benefits:

aditya0by0 commented Apr 22, 2025

aditya0by0 commented Apr 22, 2025 •

edited

Loading