Skip to content

Commit

Permalink
Merge pull request #78 from x-tabdeveloping/topic_data_upgrade
Browse files Browse the repository at this point in the history
`TopicData` overhaul and hierarchical clustering
  • Loading branch information
x-tabdeveloping authored Feb 18, 2025
2 parents ab2787e + d129561 commit 615a6a2
Show file tree
Hide file tree
Showing 29 changed files with 2,929 additions and 1,586 deletions.
3 changes: 1 addition & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ jobs:
run: python3 -c "import sys; print(sys.version)"

- name: Install dependencies
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest

run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph
- name: Run tests
run: python3 -m pytest tests/

40 changes: 6 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,13 @@


## Features
- Implementations of transformer-based topic models:
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑
- GMM :gem:
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic
- Dynamic, Online and Hierarchical Topic Modeling
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Automated topic naming with LLMs
- Topic modeling with keyphrases :key:
- Lemmatization and Stemming
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

## New in version 0.12.0: Seeded topic modeling

You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.

```python
from turftopic import KeyNMF

model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
model.fit(corpus)

model.print_topics()
```

| Topic ID | Highest Ranking |
| | |
| - | - |
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
| SOTA Transformer-based Topic Models | :compass: [](https://x-tabdeveloping.github.io/turftopic/s3/), :key: [KeyNMF](https://x-tabdeveloping.github.io/turftopic/KeyNMF/), :gem: [GMM](https://x-tabdeveloping.github.io/turftopic/GMM/), [Clustering Models](https://x-tabdeveloping.github.io/turftopic/GMM/), [CTMs](https://x-tabdeveloping.github.io/turftopic/ctm/), [FASTopic](https://x-tabdeveloping.github.io/turftopic/FASTopic/) |
| Models for all Scenarios | :chart_with_upwards_trend: [Dynamic](https://x-tabdeveloping.github.io/turftopic/dynamic/), :ocean: [Online](https://x-tabdeveloping.github.io/turftopic/online/), :herb: [Seeded](https://x-tabdeveloping.github.io/turftopic/seeded/), and :evergreen_tree: [Hierarchical](https://x-tabdeveloping.github.io/turftopic/hierarchical/) topic modeling |
| [Easy Interpretation](https://x-tabdeveloping.github.io/turftopic/model_interpretation/) | :bookmark_tabs: Pretty Printing, :bar_chart: Interactive Figures, :art: [topicwizard](https://github.com/x-tabdeveloping/topicwizard) compatible |
| [Topic Naming](https://x-tabdeveloping.github.io/turftopic/namers/) | :robot: LLM-based, N-gram Retrieval, :wave: Manual |
| [Informative Topic Descriptions](https://x-tabdeveloping.github.io/turftopic/vectorizers/) | :key: Keyphrases, Noun-phrases, Lemmatization, Stemming |


## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
Expand Down
444 changes: 256 additions & 188 deletions docs/clustering.md

Large diffs are not rendered by default.

88 changes: 43 additions & 45 deletions docs/dynamic.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

If you want to examine the evolution of topics over time, you will need a dynamic topic model.

> Note that regular static models can also be used to study the evolution of topics and information dynamics, but they can't capture changes in the topics themselves.
> You will need to install Plotly for plotting to work.
## Models
```bash
pip install plotly
```

In Turftopic you can currently use three different topic models for modeling topics over time:
You can currently use three different topic models for modeling topics over time:

1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
Expand All @@ -33,50 +35,46 @@ model = KeyNMF(5, top_n=5, random_state=42)
document_topic_matrix = model.fit_transform_dynamic(
corpus, timestamps=timestamps, bins=10
)
# or alternatively:
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=10)
```
!!! quote "Interpret Topics over Time"
=== "Interactive Plot"

```python
model.plot_topics_over_time()
# or
topic_data.plot_topics_over_time()
```

<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:100%;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>

=== "Over-time Table"

```python
model.print_topics_over_time()
# or
topic_data.print_topics_over_time()
```

<center>

| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
| - | - | - | - | - | - |
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |

</center>

You can use the `print_topics_over_time()` method for producing a table of the topics over the generated time slices.

> This example uses CNN news data.
```python
model.print_topics_over_time()
```

<center>

| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
| - | - | - | - | - | - |
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |

</center>

You can also display the topics over time on an interactive HTML figure.
The most important words for topics get revealed by hovering over them.

> You will need to install Plotly for this to work.
```bash
pip install plotly
```

```python
model.plot_topics_over_time()
```

<figure>
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
</figure>

## API reference

Expand Down
106 changes: 61 additions & 45 deletions docs/hierarchical.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
# Hierarchical Topic Modeling

> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:
You might expect some topics in your corpus to belong to a hierarchy of topics.
Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

Models in Turftopic that can model hierarchical relations will have a `hierarchy` property, that you can manipulate and print/visualize:

```python
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
# We cut at level 3 for plotting, since the hierarchy is very deep
model.hierarchy.cut(3).plot_tree()
```

_Drag and click to zoom, hover to see word importance_

<iframe src="../images/tree_plot.html", title="Topic hierarchy in a clustering model", style="height:800px;width:100%;padding:0px;border:none;"></iframe>


## Divisive Hierarchical Modeling
## 1. Divisive/Top-down Hierarchical Modeling

Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
This means that topics can be divided into subtopics in a **top-down** manner.
[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
but you can manually instruct the model to find subtopics in larger topics.
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

Expand Down Expand Up @@ -78,30 +89,12 @@ model.hierarchy.divide_children(n_subtopics=3)
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
...
</tt>
</div>

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.

You can also easily access nodes of the hierarchy by indexing it:
```python
model.hierarchy[0]
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
</tt>
</div>
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.

You can also divide individual topics to a number of subtopics, by using the `divide()` method.
Let us divide Topic 0.0 to 5 subtopics.
Expand All @@ -118,35 +111,58 @@ model.hierarchy
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ │ ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
│ │ ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
│ │ ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
│ │ ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
│ │ └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
...
</tt>
</div>

## Visualization
You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.
## 2. Agglomerative/Bottom-up Hierarchical Modeling

In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.

This is how it is done in [clustering topic models](clustering.md) like BERTopic and Top2Vec.
Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.

You can either do this by default on a clustering model by setting `n_reduce_to` on initialization or you can do it manually with `reduce_topics()`.
For more details, check our guide on [Clustering models](clustering.md).

```python
model.hierarchy.plot_tree()
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(
n_reduce_to=10,
feature_importance="centroid",
reduction_method="smallest",
reduction_topic_representation="centroid",
reduction_distance_metric="cosine",
)
model.fit(corpus)

print(model.hierarchy)
```

<figure>
<img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
<figcaption>Tree plot of the hierarchy.</figcaption>
</figure>
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root</b>: <br>
├── <b style="color:blue">-1</b>: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking<br>
├── <b style="color:blue">20</b>: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher<br>
├── <b style="color:blue">284</b>: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers<br>
│ ├── <b style="color:magenta">242</b>: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc<br>
│ │ ├── <b style="color:green">171</b>: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs<br>
│ │ │ └── ...<br>
│ │ └── <b style="color:green">21</b>: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs<br>
│ └── <b style="color:magenta">236</b>: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs<br>
...
</tt>
</div>


## API reference

::: turftopic.hierarchical.TopicNode

::: turftopic.hierarchical.DivisibleTopicNode

::: turftopic.models._hierarchical_clusters.ClusterNode



Binary file added docs/images/docs_per_second.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/performance_20ng.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 615a6a2

Please sign in to comment.