Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TopicData overhaul and hierarchical clustering #78

Merged
merged 52 commits into from
Feb 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
cb9fc73
Added plotting and printing to TopicData, and made it into a class in…
x-tabdeveloping Feb 2, 2025
24e1dc2
Fixed errors and added persistence
x-tabdeveloping Feb 2, 2025
be352b0
Abstracted all printing behaviour to a TopicContainer class
x-tabdeveloping Feb 2, 2025
8035347
Made topic_names optional
x-tabdeveloping Feb 2, 2025
efeba92
Bugfixes
x-tabdeveloping Feb 2, 2025
cdfa86e
Added dataframe functionality to TopicContainer
x-tabdeveloping Feb 2, 2025
1c92135
Added topicwizard as an optional dependency
x-tabdeveloping Feb 2, 2025
79d804a
Added dynamic functionality to TopicContainer
x-tabdeveloping Feb 7, 2025
840891f
Made CTMs components_ attribute an array instead of tensor, and expon…
x-tabdeveloping Feb 7, 2025
7a9c90d
Updated test to account for more optional fields in TopicData
x-tabdeveloping Feb 7, 2025
3378e9f
Added dataframe util for dynamic models
x-tabdeveloping Feb 7, 2025
701dff4
Fixed temporal_importance_ in TopicData
x-tabdeveloping Feb 7, 2025
4dc9dc0
Fixed prepare_dynamic_topic_data in keynmf
x-tabdeveloping Feb 7, 2025
46c2678
fix: print_topics_over_time doesn't stop at 24
x-tabdeveloping Feb 7, 2025
40dd47a
Added future annotations to allow for Unions as |
x-tabdeveloping Feb 7, 2025
eaf56ca
wip: Added DivisibleTopicNode
x-tabdeveloping Jan 31, 2025
d3fcf18
WIP: Added hierarchical ClusterNodes
x-tabdeveloping Feb 8, 2025
0d20be7
Properly implemented hierarchical topic joining based on linkage matr…
x-tabdeveloping Feb 10, 2025
bb1d903
Fixed negative topic printing
x-tabdeveloping Feb 10, 2025
0567af9
Implemented hierarchical clustering in ClusteringTopicModel
x-tabdeveloping Feb 10, 2025
dd8be91
Improved printing and plotting for hierarchical models
x-tabdeveloping Feb 10, 2025
9442055
Added igraph as a dependency
x-tabdeveloping Feb 10, 2025
80d36fc
Updated tests
x-tabdeveloping Feb 10, 2025
a42e72b
Added methods for effective manipulation of hierarchies
x-tabdeveloping Feb 10, 2025
c794a84
Added document_topic_matrix property on clustering models
x-tabdeveloping Feb 10, 2025
159ac31
Readded join_topics method, now based on the hierarchy
x-tabdeveloping Feb 10, 2025
13f17ec
Fixed test
x-tabdeveloping Feb 11, 2025
fcc23ec
fixed typo
x-tabdeveloping Feb 11, 2025
87c0e10
fixed typo
x-tabdeveloping Feb 11, 2025
1baeadd
Updated docstrings and readded reset_topics method
x-tabdeveloping Feb 11, 2025
a714849
Refactored literal types and added full control over reduction process
x-tabdeveloping Feb 11, 2025
91df521
Updated tests
x-tabdeveloping Feb 11, 2025
3370998
Added docstrings to topicwizard features on TopicData
x-tabdeveloping Feb 11, 2025
b992acd
Added hierarchy as an optional field to TopicData
x-tabdeveloping Feb 11, 2025
ccb81f4
Fixed hierarchical test
x-tabdeveloping Feb 11, 2025
e61b1e8
Fixed circular import
x-tabdeveloping Feb 11, 2025
6595576
Added hierarchy in prepare_topic_data()
x-tabdeveloping Feb 11, 2025
367a3cf
Added igraph to test dependencies
x-tabdeveloping Feb 11, 2025
decf0fe
Added new starting page and model overview to docs
x-tabdeveloping Feb 11, 2025
32c958a
Readded estimate_components to clustering models
x-tabdeveloping Feb 12, 2025
b678b24
Updated model interpretation page in docs
x-tabdeveloping Feb 12, 2025
ae27cba
Updated dynamic topic modeling docs
x-tabdeveloping Feb 12, 2025
6b809b6
Updated hierarchical docs
x-tabdeveloping Feb 12, 2025
8e3045c
Merge branch 'main' into topic_data_upgrade
x-tabdeveloping Feb 17, 2025
c774c20
Updated clustering model docs
x-tabdeveloping Feb 17, 2025
a8b511b
Updated documentation for S3
x-tabdeveloping Feb 17, 2025
d5d84d7
Added pretty printing to TopicData
x-tabdeveloping Feb 17, 2025
ae349c8
Added documentation on TopicData
x-tabdeveloping Feb 17, 2025
b05d032
version bump
x-tabdeveloping Feb 17, 2025
95676c6
Updated readme features, and removed changelog
x-tabdeveloping Feb 17, 2025
b693fb8
Removed exponentiation from CTM components
x-tabdeveloping Feb 18, 2025
d129561
Added suggested changes to README
x-tabdeveloping Feb 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ jobs:
run: python3 -c "import sys; print(sys.version)"

- name: Install dependencies
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest

run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph
- name: Run tests
run: python3 -m pytest tests/

40 changes: 6 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,13 @@


## Features
- Implementations of transformer-based topic models:
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑
- GMM :gem:
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic
- Dynamic, Online and Hierarchical Topic Modeling
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Automated topic naming with LLMs
- Topic modeling with keyphrases :key:
- Lemmatization and Stemming
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

## New in version 0.12.0: Seeded topic modeling

You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.

```python
from turftopic import KeyNMF

model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
model.fit(corpus)

model.print_topics()
```

| Topic ID | Highest Ranking |
| | |
| - | - |
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
| SOTA Transformer-based Topic Models | :compass: [S³](https://x-tabdeveloping.github.io/turftopic/s3/), :key: [KeyNMF](https://x-tabdeveloping.github.io/turftopic/KeyNMF/), :gem: [GMM](https://x-tabdeveloping.github.io/turftopic/GMM/), [Clustering Models](https://x-tabdeveloping.github.io/turftopic/GMM/), [CTMs](https://x-tabdeveloping.github.io/turftopic/ctm/), [FASTopic](https://x-tabdeveloping.github.io/turftopic/FASTopic/) |
| Models for all Scenarios | :chart_with_upwards_trend: [Dynamic](https://x-tabdeveloping.github.io/turftopic/dynamic/), :ocean: [Online](https://x-tabdeveloping.github.io/turftopic/online/), :herb: [Seeded](https://x-tabdeveloping.github.io/turftopic/seeded/), and :evergreen_tree: [Hierarchical](https://x-tabdeveloping.github.io/turftopic/hierarchical/) topic modeling |
| [Easy Interpretation](https://x-tabdeveloping.github.io/turftopic/model_interpretation/) | :bookmark_tabs: Pretty Printing, :bar_chart: Interactive Figures, :art: [topicwizard](https://github.com/x-tabdeveloping/topicwizard) compatible |
| [Topic Naming](https://x-tabdeveloping.github.io/turftopic/namers/) | :robot: LLM-based, N-gram Retrieval, :wave: Manual |
| [Informative Topic Descriptions](https://x-tabdeveloping.github.io/turftopic/vectorizers/) | :key: Keyphrases, Noun-phrases, Lemmatization, Stemming |


## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
Expand Down
444 changes: 256 additions & 188 deletions docs/clustering.md

Large diffs are not rendered by default.

88 changes: 43 additions & 45 deletions docs/dynamic.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

If you want to examine the evolution of topics over time, you will need a dynamic topic model.

> Note that regular static models can also be used to study the evolution of topics and information dynamics, but they can't capture changes in the topics themselves.
> You will need to install Plotly for plotting to work.

## Models
```bash
pip install plotly
```

In Turftopic you can currently use three different topic models for modeling topics over time:
You can currently use three different topic models for modeling topics over time:

1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
Expand All @@ -33,50 +35,46 @@ model = KeyNMF(5, top_n=5, random_state=42)
document_topic_matrix = model.fit_transform_dynamic(
corpus, timestamps=timestamps, bins=10
)
# or alternatively:
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=10)
```
!!! quote "Interpret Topics over Time"
=== "Interactive Plot"

```python
model.plot_topics_over_time()
# or
topic_data.plot_topics_over_time()
```

<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:100%;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>

=== "Over-time Table"

```python
model.print_topics_over_time()
# or
topic_data.print_topics_over_time()
```

<center>

| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
| - | - | - | - | - | - |
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |

</center>

You can use the `print_topics_over_time()` method for producing a table of the topics over the generated time slices.

> This example uses CNN news data.

```python
model.print_topics_over_time()
```

<center>

| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
| - | - | - | - | - | - |
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |

</center>

You can also display the topics over time on an interactive HTML figure.
The most important words for topics get revealed by hovering over them.

> You will need to install Plotly for this to work.

```bash
pip install plotly
```

```python
model.plot_topics_over_time()
```

<figure>
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
</figure>

## API reference

Expand Down
106 changes: 61 additions & 45 deletions docs/hierarchical.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
# Hierarchical Topic Modeling

> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:

You might expect some topics in your corpus to belong to a hierarchy of topics.
Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

Models in Turftopic that can model hierarchical relations will have a `hierarchy` property, that you can manipulate and print/visualize:

```python
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
# We cut at level 3 for plotting, since the hierarchy is very deep
model.hierarchy.cut(3).plot_tree()
```

_Drag and click to zoom, hover to see word importance_

<iframe src="../images/tree_plot.html", title="Topic hierarchy in a clustering model", style="height:800px;width:100%;padding:0px;border:none;"></iframe>


## Divisive Hierarchical Modeling
## 1. Divisive/Top-down Hierarchical Modeling

Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
This means that topics can be divided into subtopics in a **top-down** manner.
[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
but you can manually instruct the model to find subtopics in larger topics.
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

Expand Down Expand Up @@ -78,30 +89,12 @@ model.hierarchy.divide_children(n_subtopics=3)
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
...
</tt>
</div>

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.

You can also easily access nodes of the hierarchy by indexing it:
```python
model.hierarchy[0]
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
</tt>
</div>
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.

You can also divide individual topics to a number of subtopics, by using the `divide()` method.
Let us divide Topic 0.0 to 5 subtopics.
Expand All @@ -118,35 +111,58 @@ model.hierarchy
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ │ ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
│ │ ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
│ │ ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
│ │ ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
│ │ └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
...
</tt>
</div>

## Visualization
You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.
## 2. Agglomerative/Bottom-up Hierarchical Modeling

In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.

This is how it is done in [clustering topic models](clustering.md) like BERTopic and Top2Vec.
Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.

You can either do this by default on a clustering model by setting `n_reduce_to` on initialization or you can do it manually with `reduce_topics()`.
For more details, check our guide on [Clustering models](clustering.md).

```python
model.hierarchy.plot_tree()
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(
n_reduce_to=10,
feature_importance="centroid",
reduction_method="smallest",
reduction_topic_representation="centroid",
reduction_distance_metric="cosine",
)
model.fit(corpus)

print(model.hierarchy)
```

<figure>
<img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
<figcaption>Tree plot of the hierarchy.</figcaption>
</figure>
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root</b>: <br>
├── <b style="color:blue">-1</b>: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking<br>
├── <b style="color:blue">20</b>: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher<br>
├── <b style="color:blue">284</b>: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers<br>
│ ├── <b style="color:magenta">242</b>: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc<br>
│ │ ├── <b style="color:green">171</b>: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs<br>
│ │ │ └── ...<br>
│ │ └── <b style="color:green">21</b>: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs<br>
│ └── <b style="color:magenta">236</b>: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs<br>
...
</tt>
</div>


## API reference

::: turftopic.hierarchical.TopicNode

::: turftopic.hierarchical.DivisibleTopicNode

::: turftopic.models._hierarchical_clusters.ClusterNode



Binary file added docs/images/docs_per_second.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/performance_20ng.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading