Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

featurize huge datasets #396

Open
frostedoyster opened this issue Dec 24, 2024 · 1 comment
Open

featurize huge datasets #396

frostedoyster opened this issue Dec 24, 2024 · 1 comment

Comments

@frostedoyster
Copy link

At the moment, it seems that PCA requires the potentially very large n_structures x n_features feature matrix as an argument. This will not fit in memory for very large datasets.
Perhaps it would be beneficial to design a custom PCA class that allows for the accumulation of a n_features x n_features covariance matrix, which is manageable and can be diagonalized once all structures have been processed. In this way, the exploration of potentially huge datasets should become possible even on ordinary laptops, potentially taking advantage of batched evaluation (and a few hours of runtime)

@Luthaf
Copy link
Contributor

Luthaf commented Jan 6, 2025

We (i.e. @sofiia-chorna) explored batched PCA a bit, but it was not better than full PCA at the time. If someone else wants to give it another go feel free though!

Another option in that one can use immediately is a custom featurize function, that can use alternative algorithms for dimensionality reduction without any change to chemiscope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants