Using the high-dimensional embeddings stored by UMAP #793

zachschillaci27 · 2022-10-21T13:50:14Z

zachschillaci27
Oct 21, 2022

Hi Maarten,

Thanks for the great work you've done creating BERTopic!

In some of my local testing on reducing model size, I noticed the UMAP subcomponent can use up quite a bit of space when trained on large quantities of data. I eventually discovered this is because UMAP requires storing the entirety of the raw dataset it is trained on (i.e. the high-dimensional sentence embeddings). Unfortunately, this seems to be a hard requirement of the UMAP algorithm.

This however got me thinking that if the high-dimensional embeddings are anyways being saved by UMAP, then they may as well be used throughout later steps in BERTopic. Have you considered this possibility before?

For reference, UMAP stores the raw data during model fitting here https://github.com/lmcinnes/umap/blob/master/umap/umap_.py#L2321. Therefore, the embeddings can be accessed from any UMAP-trained BERTopic model via bt.umap_model._raw_data.

Best,
Zach

MaartenGr · 2022-10-22T08:45:36Z

MaartenGr
Oct 22, 2022
Maintainer

Yes, UMAP, due to its underlying algorithm and architecture, needs to save the input embeddings which can make the model quite large. Although it is indeed possible to use those embeddings throughout BERTopic this then becomes a bit more difficult when you use something else, like PCA. In those cases, especially when it concerns online/incremental topic modeling, you do not want to save those embeddings as they do not relate to the current state of the model. Similarly, we want to separate each step in the pipeline as much as possible such that we do not rely too much on upon, for example, the input language model.

1 reply

zachschillaci27 Oct 24, 2022
Author

This makes a lot of sense, thanks for the reply! The interchangeability of sub-components is one of the many great features of BERTopic, so I understand the desire to not become too over-reliant on the specifics of UMAP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the high-dimensional embeddings stored by UMAP #793

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Using the high-dimensional embeddings stored by UMAP #793

zachschillaci27 Oct 21, 2022

Replies: 1 comment · 1 reply

MaartenGr Oct 22, 2022 Maintainer

zachschillaci27 Oct 24, 2022 Author

zachschillaci27
Oct 21, 2022

Replies: 1 comment 1 reply

MaartenGr
Oct 22, 2022
Maintainer

zachschillaci27 Oct 24, 2022
Author