Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added seed phrases to KeyNMF #77

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Added seed phrases to KeyNMF #77

wants to merge 6 commits into from

Conversation

x-tabdeveloping
Copy link
Owner

@x-tabdeveloping x-tabdeveloping commented Jan 31, 2025

You can now add a seed_phrase to a KeyNMF model, which essentially indicates the aspect, from which the model has to examine documents.

from sklearn.datasets import fetch_20newsgroups

from turftopic import KeyNMF

corpus = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
).data

model = KeyNMF(5, seed_phrase="Is homosexuality moral?")
model.fit(corpus)

model.print_topics()
Topic ID Highest Ranking
0 homosexuality, homosexual, immoral, sodom, heterosexual, sexual, fornication, christians, verses, sex
1 morality, moral, immoral, morals, objective, morally, society, animals, behavior, natural
2 christians, christian, christianity, religion, bible, god, church, religious, faith, beliefs
3 homosexual, homosexuals, heterosexual, gay, sexual, sex, heterosexuals, straight, men, sexuality
4 sin, sins, god, sinner, sinful, condemnation, sinned, scripture, punishment, sinners

TODO:

  • Add documentation
  • Release new version

@x-tabdeveloping x-tabdeveloping changed the title WIP: Added seed phrases to KeyNMF Added seed phrases to KeyNMF Feb 1, 2025
@x-tabdeveloping
Copy link
Owner Author

@KennethEnevoldsen can I has review?

Copy link
Collaborator

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great a few ideas to restructure the docs - nothing holding back this PR though

@@ -20,42 +20,26 @@
- Lemmatization and Stemming
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

## New in version 0.12.0: Seeded topic modeling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you should really keep a changelog (to many important tidbits in these that people likely miss out on)

You could potentially do it in a dropdown menu ("See previous versions (click to unfold)")

@@ -8,20 +8,30 @@ while taking inspiration from classical matrix-decomposition approaches for extr
<figcaption>Schematic overview of KeyNMF</figcaption>
</figure>


Here's an example of how you can fit and interpret a KeyNMF model in the easiest way.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here's an example of how you can fit and interpret a KeyNMF model in the easiest way.
Here's an example of how you can fit and interpret a KeyNMF model.

model.fit(corpus)

model.print_topics()
```

!!! question "Which Embedding model should I use"
- You should probably use KeyNMF with a `paraphrase-` type embedding model. These seem to perform best in most tasks. Some examples include:
- [paraphrase-MiniLM-L3-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L3-v2) - Absolutely tiny :mouse:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would focus on speed, not all will know that size and speed are related.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is a bit redundant if you can simply use the static-retrieval-mrl-en-v1?

Comment on lines +132 to +133
In KeyNMF, you can describe this aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
which will then be used to only extract topics, which are relevant to your research question.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this idea been explored before? If so a reference would be great

@@ -354,46 +424,49 @@ for batch in batched(zip(corpus, timestamps)):
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
```

### Hierarchical Topic Modeling
## Asymmetric and Instruction-tuned Embedding Models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these things are specifically in the KeyNMF docs, why not put them in a general section?


### Asymmetric and Instruction-tuned Embedding Models
## Seeded Topic Modeling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to refer to the general documentation of seeded topic modeling here (seems like a lot of duplication).

Might it not be better to create a table like "supported types of topic modelling" and then in the "Seeded Topic Modelling" section add "Models which support seeded topic modelling".

Some models are able to account for this by taking seed phrases or words.
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.

In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above as well.

Would probably write this more simply and then put this in a Tab section called KeyNMF (that way it is easy to see that the only one supported in KeyNMF, but also that there could be others in the future.

@@ -120,6 +120,8 @@ def batch_extract_keywords(
self,
documents: list[str],
embeddings: Optional[np.ndarray] = None,
seed_embedding: Optional[np.ndarray] = None,
fitting: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What fitting do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants