Zero topic distributions for some documents using approximate_distribution() #2150

Connorwz · 2024-09-14T14:45:50Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Dear creators of BERTopic,
Thanks for your work and this package is amazing. I have been using it for a long time. However, I found some documents (no matter whether they are used to train the model) have zero topic distributions for all topics created by BERTopic after applying approximate_distribution() function on them. It means that the topic distribution matrix produced by approximate_distribution() has some rows having sum of 0. Codes below did several things: (1) build a simple setup for a BERTopic model with PCA and KMEANS (from cuML) as the dimension reduction and clustering technique. (2) define a splitting function to split documents and pre-caculated embeddings. (3) fit the model on training data and compute topic distributions for both training and testing data set.

If more information is needed, please let me know. Thanks!

Reproduction

def pk(num_cluster):
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    umap_model = PCA(n_components = 10)
    hdbscan_model = KMeans(n_clusters = num_cluster)
    vectorizer_model = CountVectorizer()
    Topic_model = BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model,
                        calculate_probabilities = False,verbose = True)
    return Topic_model
def tr_te_split(documents,df,embeddings, i=1):
    indices = np.arange(len(documents))
    tr_ind, te_ind = train_test_split(indices, test_size=0.2, shuffle= True, random_state=i)
    tr_df = df.iloc[tr_ind,:]
    te_df = df.iloc[te_ind,:]
    tr_documents = [documents[ind] for ind in tr_ind]
    te_documents = [documents[ind] for ind in te_ind]
    tr_embeddings = embeddings[tr_ind,:]
    return tr_df,te_df,tr_documents,te_documents,tr_embeddings
def check_zero_exposure(arr):
    if 0 in  np.apply_along_axis(arr=arr,func1d=np.sum,axis=1):
        return 1
zero_exposures = {}
for year in year_list:
    df = pd.read_csv(df_folder+f"/contem_{year}_senti.csv")
    documents = df.documents.tolist()
    embeddings = np.load(embeddings_folder+f"/contem_{year}_senti_embeddings.npy")
    tr_df, te_df, tr_documents,te_documents,tr_embeddings = tr_te_split(documents,df,embeddings)
    tr_df.reset_index(drop=True,inplace=True)
    te_df.reset_index(drop=True,inplace=True)
    topic_model = pk(cluster_num)
    topic_model.fit(tr_documents,tr_embeddings)
    tr_topic_dist, _ = topic_model.approximate_distribution(tr_documents)
    te_topic_dist, _ = topic_model.approximate_distribution(te_documents)
    zero_exposure = [check_zero_exposure(tr_topic_dist),check_zero_exposure(te_topic_dist)]
    zero_exposures[year] = zero_exposure


# zero_exposures
# {2014: [1, 1],
#  2015: [1, 1],
#  2016: [1, 1],
#  2017: [1, 1],
#  2018: [1, 1],
#  2019: [1, 1],
#  2020: [1, 1],
#  2021: [1, 1],
#  2022: [1, 1],
#  2023: [1, 1]}

BERTopic Version

0.16.2

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-09-15T10:34:04Z

Have you tried looking at some of the hyperparameters of approximate_distribution? Since there are similarity metrics/values involved, it might help to look at whether you can reduce the minimum similarity necessary. You can find more about some of them here.

Connorwz · 2024-09-15T14:46:38Z

Thanks for your reply! However, may I ask why minimum similarity affects my problem that i have some documents have no probabilities to any one of clusters/topics created by the model?

MaartenGr · 2024-09-16T11:36:20Z

Sure! You need a minimum similarity to decide which subset of topics is most related to your document. It allows you to filter the most related topics. By lowering the minimum similarity, you will get more topics related to the document (although their similarity values will not change).

Connorwz · 2024-09-16T20:48:05Z

Thanks for your explanations! So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them. Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

MaartenGr · 2024-09-17T13:51:30Z

So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them.

That's correct!

Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

Yes! In practice, it calculates all the similarities and then just ignores those that do not exceed the threshold but the result is the same.

Connorwz added the bug Something isn't working label Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero topic distributions for some documents using approximate_distribution() #2150

Zero topic distributions for some documents using approximate_distribution() #2150

Connorwz commented Sep 14, 2024 •

edited

Loading

MaartenGr commented Sep 15, 2024

Connorwz commented Sep 15, 2024

MaartenGr commented Sep 16, 2024

Connorwz commented Sep 16, 2024

MaartenGr commented Sep 17, 2024

Zero topic distributions for some documents using approximate_distribution() #2150

Zero topic distributions for some documents using approximate_distribution() #2150

Comments

Connorwz commented Sep 14, 2024 • edited Loading

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version

MaartenGr commented Sep 15, 2024

Connorwz commented Sep 15, 2024

MaartenGr commented Sep 16, 2024

Connorwz commented Sep 16, 2024

MaartenGr commented Sep 17, 2024

Connorwz commented Sep 14, 2024 •

edited

Loading