-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added seed phrases to KeyNMF #77
base: main
Are you sure you want to change the base?
Changes from all commits
e282104
560682e
c66ec54
12402f9
bb55d15
b22acdc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Seeded Topic Modeling | ||
|
||
When investigating a set of documents, you might already have an idea about what aspects you would like to explore. | ||
Some models are able to account for this by taking seed phrases or words. | ||
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future. | ||
|
||
In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See the comment above as well. Would probably write this more simply and then put this in a Tab section called KeyNMF (that way it is easy to see that the only one supported in KeyNMF, but also that there could be others in the future. |
||
which will then be used to only extract topics, which are relevant to your research question. | ||
|
||
In this example we investigate the 20Newsgroups corpus from three different aspects: | ||
|
||
```python | ||
from sklearn.datasets import fetch_20newsgroups | ||
|
||
from turftopic import KeyNMF | ||
|
||
corpus = fetch_20newsgroups( | ||
subset="all", | ||
remove=("headers", "footers", "quotes"), | ||
).data | ||
|
||
model = KeyNMF(5, seed_phrase="<your seed phrase>") | ||
model.fit(corpus) | ||
|
||
model.print_topics() | ||
``` | ||
|
||
|
||
=== "`'Is the death penalty moral?'`" | ||
|
||
| Topic ID | Highest Ranking | | ||
| - | - | | ||
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior | | ||
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish | | ||
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed | | ||
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals | | ||
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity | | ||
|
||
=== "`'Evidence for the existence of god'`" | ||
|
||
| Topic ID | Highest Ranking | | ||
| - | - | | ||
| 0 | atheist, atheists, religion, religious, theists, beliefs, christianity, christian, religions, agnostic | | ||
| 1 | bible, christians, christian, christianity, church, scripture, religion, jesus, faith, biblical | | ||
| 2 | god, existence, exist, exists, universe, creation, argument, creator, believe, life | | ||
| 3 | believe, faith, belief, evidence, blindly, believing, gods, believed, beliefs, convince | | ||
| 4 | atheism, atheists, agnosticism, belief, arguments, believe, existence, alt, believing, argument | | ||
|
||
=== "`'Operating system kernels'`" | ||
|
||
| Topic ID | Highest Ranking | | ||
| - | - | | ||
| 0 | windows, dos, os, microsoft, ms, apps, pc, nt, file, shareware | | ||
| 1 | ram, motherboard, card, monitor, memory, cpu, vga, mhz, bios, intel | | ||
| 2 | unix, os, linux, intel, systems, programming, applications, compiler, software, platform | | ||
| 3 | disk, scsi, disks, drive, floppy, drives, dos, controller, cd, boot | | ||
| 4 | software, mac, hardware, ibm, graphics, apple, computer, pc, modem, program | | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ line-length=79 | |
|
||
[tool.poetry] | ||
name = "turftopic" | ||
version = "0.11.0" | ||
version = "0.12.0" | ||
description = "Topic modeling with contextual representations from sentence transformers." | ||
authors = ["Márton Kardos <[email protected]>"] | ||
license = "MIT" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -120,6 +120,8 @@ def batch_extract_keywords( | |
self, | ||
documents: list[str], | ||
embeddings: Optional[np.ndarray] = None, | ||
seed_embedding: Optional[np.ndarray] = None, | ||
fitting: bool = True, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What fitting do? |
||
) -> list[dict[str, float]]: | ||
if not len(documents): | ||
return [] | ||
|
@@ -135,13 +137,25 @@ def batch_extract_keywords( | |
"Number of documents doesn't match number of embeddings." | ||
) | ||
keywords = [] | ||
vectorizer = clone(self.vectorizer) | ||
document_term_matrix = vectorizer.fit_transform(documents) | ||
batch_vocab = vectorizer.get_feature_names_out() | ||
if fitting: | ||
document_term_matrix = self.vectorizer.fit_transform(documents) | ||
else: | ||
document_term_matrix = self.vectorizer.transform(documents) | ||
batch_vocab = self.vectorizer.get_feature_names_out() | ||
new_terms = list(set(batch_vocab) - set(self.key_to_index.keys())) | ||
if len(new_terms): | ||
self._add_terms(new_terms) | ||
total = embeddings.shape[0] | ||
# Relevance based on similarity to seed embedding | ||
document_relevance = None | ||
if seed_embedding is not None: | ||
if self.metric == "cosine": | ||
document_relevance = cosine_similarity( | ||
[seed_embedding], embeddings | ||
)[0] | ||
else: | ||
document_relevance = np.dot(embeddings, seed_embedding) | ||
document_relevance[document_relevance < 0] = 0 | ||
for i in range(total): | ||
terms = document_term_matrix[i, :].todense() | ||
embedding = embeddings[i].reshape(1, -1) | ||
|
@@ -162,14 +176,13 @@ def batch_extract_keywords( | |
) | ||
) | ||
if self.metric == "cosine": | ||
sim = cosine_similarity(embedding, word_embeddings).astype( | ||
np.float64 | ||
) | ||
sim = cosine_similarity(embedding, word_embeddings) | ||
sim = np.ravel(sim) | ||
else: | ||
sim = np.dot(word_embeddings, embedding[0]).T.astype( | ||
np.float64 | ||
) | ||
sim = np.dot(word_embeddings, embedding[0]).T | ||
# If a seed is specified, we multiply by the document's relevance | ||
if document_relevance is not None: | ||
sim = document_relevance[i] * sim | ||
kth = min(self.top_n, len(sim) - 1) | ||
top = np.argpartition(-sim, kth)[:kth] | ||
top_words = batch_vocab[important_terms][top] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you should really keep a changelog (to many important tidbits in these that people likely miss out on)
You could potentially do it in a dropdown menu ("See previous versions (click to unfold)")