-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError: index -2 is out of bounds for axis 0 with size 1 for the zero shot code. #1749
Comments
I believe this is a result of setting |
Note that there is also a preliminary fix available at #1762 which should resolve the issue entirely. |
Hello, Zero-Shot is a perfect extension. Thanks so much you. ` All steps togethertopic_model = BERTopic( 2024-02-06 13:28:02,189 - BERTopic - Embedding - Transforming documents to embeddings.
|
Thanks for your realy quick response. |
@hubernst Can you provide a reproducible example? You shared very limited code so it's unclear for example what is in |
Hi, thanks for your answer.
IndexError Traceback (most recent call last) File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y) File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/bertopic.py:3554, in BERTopic.combine_zeroshot_topics(self, documents, assigned_documents, embeddings) File /opt/conda/envs/Python-3.10-Premium/lib/python3.10/site-packages/bertopic/_bertopic.py:3167, in BERTopic.merge_models(cls, models, min_similarity, embedding_model) IndexError: index -2 is out of bounds for axis 0 with size 1 It works if I am not using zero-shot topic modeling. Many greetings |
I think this issue then relates to #1797 which should be relatively straightforward to fix. I would advise keeping an eye on that issue until a fix is released. |
Hello, yes, of course I will check it, thank you for the fix! Hopefully today, tomorrow afternoon at the latest. |
Glad to hear that it resolved at least this issue ;-) I added my response to that specific issue there. |
When running zero-shot topic modelling, I encounter the following error: I had been using this same approach on a weekly basis for a few months with no issues, but have recently changed my embedding model from Open AI's I cannot share my documents, as its sensitive for my company, but my code is below. If I change the from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import BaseRepresentation, OpenAI
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from openai import AzureOpenAI
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
# create Azure OpenAI client
client = AzureOpenAI(
api_key=...,
api_version=2024-10-21,
azure_endpoint=...,
)
# 1. embeddings
embedding_model = OpenAIBackend(
client,
"text-embedding-3-large",
generator_kwargs={
"dimensions": 768
}
)
# 2. dimensionality reduction
umap_model = UMAP(
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=42 # prevents stochastic behaviour
)
# 3. clustering
hdbscan_model = HDBSCAN(
min_cluster_size=10,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True
)
# 4. bag-of-words
vectorizer_model = CountVectorizer(
stop_words="english",
ngram_range=(1, 2)
)
# 5. topic representation
ctfidf_model = ClassTfidfTransformer()
# 6. list of zero-shot topics
zeroshot_topic_list = user_topics["name"].tolist() # have to keep this secret, but it's just a list of strings
# fit model to data
topic_model = BERTopic(
# algorithm components
embedding_model=embedding_model, # Step 1 - Embedding model backend
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
# hyperparameters
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=0.75,
min_topic_size=5,
nr_topics="auto",
verbose=True,
)
# Fit BERTopic using pre-computed embeddings
topic_model.fit(docs, embeddings=embeddings) Here is the output before the error: 2025-01-06 01:15:52,232 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-06 01:16:18,496 - BERTopic - Dimensionality - Completed ✓
2025-01-06 01:16:18,498 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2025-01-06 01:16:18,694 - BERTopic - Zeroshot Step 1 - Completed ✓
2025-01-06 01:16:52,137 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-06 01:16:52,227 - BERTopic - Cluster - Completed ✓
2025-01-06 01:16:52,228 - BERTopic - Zeroshot Step 2 - Combining topics from zero-shot topic modeling with topics from clustering...
2025-01-06 01:16:52,247 - BERTopic - Zeroshot Step 2 - Completed ✓
2025-01-06 01:16:52,248 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-06 01:16:52,402 - BERTopic - Representation - Completed ✓
2025-01-06 01:16:52,404 - BERTopic - Topic reduction - Reducing number of topics
|
@James-Leslie It might be a result of the updated embedding model (which might change the distribution of similarities) but also a bug that was in earlier versions of BERTopic. Are you using the latest (v0.16.4)? |
Hi @MaartenGr, I have lowered the threshold from 0.85 to 0.75 to account for the new model's distribution. Using version 0.16.4. I found the error doesn't happen if I leave the If I leave |
@JamesLeslieAT @James-Leslie I just created a PR that should have fixed the issue. Could you try it out? |
I almost not change too many for the example code of the zero shot but has this error. Could you help me to solve it? Thanks. :from datasets import load_dataset
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]
We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
We fit our model using the zero-shot topics
and we define a minimum similarity. For each document,
if the similarity does not exceed that value, it will be used
for clustering instead.
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)
The text was updated successfully, but these errors were encountered: