Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero topic distributions for some documents using approximate_distribution() #2150

Open
1 task done
Connorwz opened this issue Sep 14, 2024 · 5 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@Connorwz
Copy link

Connorwz commented Sep 14, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Dear creators of BERTopic,
Thanks for your work and this package is amazing. I have been using it for a long time. However, I found some documents (no matter whether they are used to train the model) have zero topic distributions for all topics created by BERTopic after applying approximate_distribution() function on them. It means that the topic distribution matrix produced by approximate_distribution() has some rows having sum of 0. Codes below did several things: (1) build a simple setup for a BERTopic model with PCA and KMEANS (from cuML) as the dimension reduction and clustering technique. (2) define a splitting function to split documents and pre-caculated embeddings. (3) fit the model on training data and compute topic distributions for both training and testing data set.

If more information is needed, please let me know. Thanks!

Reproduction

def pk(num_cluster):
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    umap_model = PCA(n_components = 10)
    hdbscan_model = KMeans(n_clusters = num_cluster)
    vectorizer_model = CountVectorizer()
    Topic_model = BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model,
                        calculate_probabilities = False,verbose = True)
    return Topic_model
def tr_te_split(documents,df,embeddings, i=1):
    indices = np.arange(len(documents))
    tr_ind, te_ind = train_test_split(indices, test_size=0.2, shuffle= True, random_state=i)
    tr_df = df.iloc[tr_ind,:]
    te_df = df.iloc[te_ind,:]
    tr_documents = [documents[ind] for ind in tr_ind]
    te_documents = [documents[ind] for ind in te_ind]
    tr_embeddings = embeddings[tr_ind,:]
    return tr_df,te_df,tr_documents,te_documents,tr_embeddings
def check_zero_exposure(arr):
    if 0 in  np.apply_along_axis(arr=arr,func1d=np.sum,axis=1):
        return 1
zero_exposures = {}
for year in year_list:
    df = pd.read_csv(df_folder+f"/contem_{year}_senti.csv")
    documents = df.documents.tolist()
    embeddings = np.load(embeddings_folder+f"/contem_{year}_senti_embeddings.npy")
    tr_df, te_df, tr_documents,te_documents,tr_embeddings = tr_te_split(documents,df,embeddings)
    tr_df.reset_index(drop=True,inplace=True)
    te_df.reset_index(drop=True,inplace=True)
    topic_model = pk(cluster_num)
    topic_model.fit(tr_documents,tr_embeddings)
    tr_topic_dist, _ = topic_model.approximate_distribution(tr_documents)
    te_topic_dist, _ = topic_model.approximate_distribution(te_documents)
    zero_exposure = [check_zero_exposure(tr_topic_dist),check_zero_exposure(te_topic_dist)]
    zero_exposures[year] = zero_exposure


# zero_exposures
# {2014: [1, 1],
#  2015: [1, 1],
#  2016: [1, 1],
#  2017: [1, 1],
#  2018: [1, 1],
#  2019: [1, 1],
#  2020: [1, 1],
#  2021: [1, 1],
#  2022: [1, 1],
#  2023: [1, 1]}

BERTopic Version

0.16.2

@Connorwz Connorwz added the bug Something isn't working label Sep 14, 2024
@MaartenGr
Copy link
Owner

Have you tried looking at some of the hyperparameters of approximate_distribution? Since there are similarity metrics/values involved, it might help to look at whether you can reduce the minimum similarity necessary. You can find more about some of them here.

@Connorwz
Copy link
Author

Thanks for your reply! However, may I ask why minimum similarity affects my problem that i have some documents have no probabilities to any one of clusters/topics created by the model?

@MaartenGr
Copy link
Owner

Sure! You need a minimum similarity to decide which subset of topics is most related to your document. It allows you to filter the most related topics. By lowering the minimum similarity, you will get more topics related to the document (although their similarity values will not change).

@Connorwz
Copy link
Author

Thanks for your explanations! So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them. Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

@MaartenGr
Copy link
Owner

So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them.

That's correct!

Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

Yes! In practice, it calculates all the similarities and then just ignores those that do not exceed the threshold but the result is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants