Using the high-dimensional embeddings stored by UMAP #793
zachschillaci27
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Yes, UMAP, due to its underlying algorithm and architecture, needs to save the input embeddings which can make the model quite large. Although it is indeed possible to use those embeddings throughout BERTopic this then becomes a bit more difficult when you use something else, like PCA. In those cases, especially when it concerns online/incremental topic modeling, you do not want to save those embeddings as they do not relate to the current state of the model. Similarly, we want to separate each step in the pipeline as much as possible such that we do not rely too much on upon, for example, the input language model. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Maarten,
Thanks for the great work you've done creating BERTopic!
In some of my local testing on reducing model size, I noticed the UMAP subcomponent can use up quite a bit of space when trained on large quantities of data. I eventually discovered this is because UMAP requires storing the entirety of the raw dataset it is trained on (i.e. the high-dimensional sentence embeddings). Unfortunately, this seems to be a hard requirement of the UMAP algorithm.
This however got me thinking that if the high-dimensional embeddings are anyways being saved by UMAP, then they may as well be used throughout later steps in BERTopic. Have you considered this possibility before?
For reference, UMAP stores the raw data during model fitting here https://github.com/lmcinnes/umap/blob/master/umap/umap_.py#L2321. Therefore, the embeddings can be accessed from any UMAP-trained BERTopic model via
bt.umap_model._raw_data
.Best,
Zach
Beta Was this translation helpful? Give feedback.
All reactions