Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added seed phrases to KeyNMF #77

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 32 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,42 +20,26 @@
- Lemmatization and Stemming
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

## New in version 0.12.0: Seeded topic modeling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you should really keep a changelog (to many important tidbits in these that people likely miss out on)

You could potentially do it in a dropdown menu ("See previous versions (click to unfold)")


## New in version 0.11.0: Vectorizers Module

You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.

```python
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = KeyNMF(
n_components=10,
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
)
model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
model.fit(corpus)

model.print_topics()
```

| Topic ID | Highest Ranking |
| - | - |
| | ... |
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
| | ... |

Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**.

```python
from turftopic.vectorizers.chinese import default_chinese_vectorizer
from turftopic.vectorizers.spacy import TokenCountVectorizer

chinese_vectorizer = default_chinese_vectorizer()
arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True)
danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True)
...

```
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |


## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
Expand Down Expand Up @@ -179,6 +163,29 @@ model.print_topics()
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
| | ... |

### Vectorizers Module

You can use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.

```python
from turftopic import KeyNMF
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer

model = KeyNMF(
n_components=10,
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
)
model.fit(corpus)
model.print_topics()
```

| Topic ID | Highest Ranking |
| - | - |
| | ... |
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
| | ... |

### Visualization

Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
Expand Down
239 changes: 156 additions & 83 deletions docs/KeyNMF.md

Large diffs are not rendered by default.

772 changes: 772 additions & 0 deletions docs/images/nmf_explanation.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 59 additions & 0 deletions docs/seeded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Seeded Topic Modeling

When investigating a set of documents, you might already have an idea about what aspects you would like to explore.
Some models are able to account for this by taking seed phrases or words.
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.

In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above as well.

Would probably write this more simply and then put this in a Tab section called KeyNMF (that way it is easy to see that the only one supported in KeyNMF, but also that there could be others in the future.

which will then be used to only extract topics, which are relevant to your research question.

In this example we investigate the 20Newsgroups corpus from three different aspects:

```python
from sklearn.datasets import fetch_20newsgroups

from turftopic import KeyNMF

corpus = fetch_20newsgroups(
subset="all",
remove=("headers", "footers", "quotes"),
).data

model = KeyNMF(5, seed_phrase="<your seed phrase>")
model.fit(corpus)

model.print_topics()
```


=== "`'Is the death penalty moral?'`"

| Topic ID | Highest Ranking |
| - | - |
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |

=== "`'Evidence for the existence of god'`"

| Topic ID | Highest Ranking |
| - | - |
| 0 | atheist, atheists, religion, religious, theists, beliefs, christianity, christian, religions, agnostic |
| 1 | bible, christians, christian, christianity, church, scripture, religion, jesus, faith, biblical |
| 2 | god, existence, exist, exists, universe, creation, argument, creator, believe, life |
| 3 | believe, faith, belief, evidence, blindly, believing, gods, believed, beliefs, convince |
| 4 | atheism, atheists, agnosticism, belief, arguments, believe, existence, alt, believing, argument |

=== "`'Operating system kernels'`"

| Topic ID | Highest Ranking |
| - | - |
| 0 | windows, dos, os, microsoft, ms, apps, pc, nt, file, shareware |
| 1 | ram, motherboard, card, monitor, memory, cpu, vga, mhz, bios, intel |
| 2 | unix, os, linux, intel, systems, programming, applications, compiler, software, platform |
| 3 | disk, scsi, disks, drive, floppy, drives, dos, controller, cd, boot |
| 4 | software, mac, hardware, ibm, graphics, apple, computer, pc, modem, program |


1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ nav:
- Interpreting and Visualizing Models: model_interpretation.md
- Modifying and Finetuning Models: finetuning.md
- Saving and Loading Models: persistence.md
- Seeded Topic Modeling: seeded.md
- Dynamic Topic Modeling: dynamic.md
- Online Topic Modeling: online.md
- Hierarchical Topic Modeling: hierarchical.md
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.11.0"
version = "0.12.0"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <[email protected]>"]
license = "MIT"
Expand Down
31 changes: 22 additions & 9 deletions turftopic/models/_keynmf.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ def batch_extract_keywords(
self,
documents: list[str],
embeddings: Optional[np.ndarray] = None,
seed_embedding: Optional[np.ndarray] = None,
fitting: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What fitting do?

) -> list[dict[str, float]]:
if not len(documents):
return []
Expand All @@ -135,13 +137,25 @@ def batch_extract_keywords(
"Number of documents doesn't match number of embeddings."
)
keywords = []
vectorizer = clone(self.vectorizer)
document_term_matrix = vectorizer.fit_transform(documents)
batch_vocab = vectorizer.get_feature_names_out()
if fitting:
document_term_matrix = self.vectorizer.fit_transform(documents)
else:
document_term_matrix = self.vectorizer.transform(documents)
batch_vocab = self.vectorizer.get_feature_names_out()
new_terms = list(set(batch_vocab) - set(self.key_to_index.keys()))
if len(new_terms):
self._add_terms(new_terms)
total = embeddings.shape[0]
# Relevance based on similarity to seed embedding
document_relevance = None
if seed_embedding is not None:
if self.metric == "cosine":
document_relevance = cosine_similarity(
[seed_embedding], embeddings
)[0]
else:
document_relevance = np.dot(embeddings, seed_embedding)
document_relevance[document_relevance < 0] = 0
for i in range(total):
terms = document_term_matrix[i, :].todense()
embedding = embeddings[i].reshape(1, -1)
Expand All @@ -162,14 +176,13 @@ def batch_extract_keywords(
)
)
if self.metric == "cosine":
sim = cosine_similarity(embedding, word_embeddings).astype(
np.float64
)
sim = cosine_similarity(embedding, word_embeddings)
sim = np.ravel(sim)
else:
sim = np.dot(word_embeddings, embedding[0]).T.astype(
np.float64
)
sim = np.dot(word_embeddings, embedding[0]).T
# If a seed is specified, we multiply by the document's relevance
if document_relevance is not None:
sim = document_relevance[i] * sim
kth = min(self.top_n, len(sim) - 1)
top = np.argpartition(-sim, kth)[:kth]
top_words = batch_vocab[important_terms][top]
Expand Down
19 changes: 17 additions & 2 deletions turftopic/models/keynmf.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ class KeyNMF(ContextualModel, DynamicTopicModel):
Random state to use so that results are exactly reproducible.
metric: "cosine" or "dot", default "cosine"
Similarity metric to use for keyword extraction.
seed_phrase: str, default None
Describes an aspect of the corpus that the model should explore.
It can be a free-text query, such as
"Christian Denominations: Protestantism and Catholicism"
"""

def __init__(
Expand All @@ -61,6 +65,7 @@ def __init__(
top_n: int = 25,
random_state: Optional[int] = None,
metric: Literal["cosine", "dot"] = "cosine",
seed_phrase: Optional[str] = None,
):
self.random_state = random_state
self.n_components = n_components
Expand All @@ -85,11 +90,16 @@ def __init__(
encoder=self.encoder_,
metric=self.metric,
)
self.seed_phrase = seed_phrase
self.seed_embedding = None
if self.seed_phrase is not None:
self.seed_embedding = self.encoder_.encode([self.seed_phrase])[0]

def extract_keywords(
self,
batch_or_document: Union[str, list[str]],
embeddings: Optional[np.ndarray] = None,
fitting: bool = True,
) -> list[dict[str, float]]:
"""Extracts keywords from a document or a batch of documents.

Expand All @@ -103,7 +113,10 @@ def extract_keywords(
if isinstance(batch_or_document, str):
batch_or_document = [batch_or_document]
return self.extractor.batch_extract_keywords(
batch_or_document, embeddings=embeddings
batch_or_document,
embeddings=embeddings,
seed_embedding=self.seed_embedding,
fitting=fitting,
)

def vectorize(
Expand Down Expand Up @@ -249,7 +262,9 @@ def transform(
)
if keywords is None:
keywords = self.extract_keywords(
list(raw_documents), embeddings=embeddings
list(raw_documents),
embeddings=embeddings,
fitting=False,
)
return self.model.transform(keywords)

Expand Down