- Explore natural language processing techniques by creating a topic model using TED talk transcripts
- Develop an application for topic modeling by creating a simple TED talk recommender
- Develop an interactive front end to showcase the data, exploratory data analysis, topic modeling, and recommender
- Final presentation hosted on Google Slides here.
- Final deliverable was an interactive app deployed on Heroku, created using Streamlit. The app allowed you to explore the data, exploratory data analysis, algorithms, and use the recommender.
- You can also run locally using
streamlit run interface.py
.
- You can also run locally using
TED talks are currently categorized under hundreds of topics. In fact, on the TED website itself, a wide range of topics are listed here, from niche topics like "biomimicry" to general ideas like "big problems." This project began by using natural language processing and unsupervised learning to create a smaller set of topics with which to categorize Ted Talks.
Next, I created and deployed a simple recommendation systems using the topics modeled above to recommend TED talks based on Jensen-Shannon divergence. Currently, you can input a title or an index and the system will generate the topic distribution, summary, link, and existing tags provided by TED for the talk, along with the 5 most similar and 5 most dissimilar talks.
Finally, I used Streamlit to develop an interactive interface to view exploratory data analysis and deliverables of project. I then used Heroku to deploy the app online. The user can use the sidebar to explore the algorithms used, exploratory data analysis, and customize some visualizations as well. The user can also generate talk topic distributions and TED talk recommendations based on the talk title or random index.
- Information about 4,200+ TED talks were web scraped from the official TED website. The dataset only includes the TED talks available from the quicklist of all TED talks here.
- Topic modeling focused on the 3,600+ TED talks with transcripts
- Created a custom tokenizer (see tokenizer.py) utilizing NLTK and SpaCy packages.
- Added missing spaces after punctuation
- Removed parenthetical phrases about "applause," "laughter," and "music"
- Handled cases with numbers in hyphened words and with commas
- Removed punctuation, stop words, and lower-cased all tokens
- Exclusively looked at unigrams
- After applying the tokenizer to the 3,600+ TED talks, had 2,300,000+ tokens (52,000+ unique tokens).
- Used a Count Vectorizer to exclude tokens that occurred in more than 70% of the transcripts, and in fewer than 4 transcripts.
- 15,000+ unique tokens (15,000+ features) used in the final model
- After vectorizing the tokenized transcripts, input data into Latent Dirichlet Allocation (LDA)
- Tested 10, 13, 15, 16, 17, 20, and 25 topics
- Evaluated using 80:20 train-test-split, log-likelihood, perplexity, and human readability
- Although a low log-likelihood and high perplexity is what many aim for, they have been found at times to result in topics that are counter to human readability. For this project, I focused on the latter.
Number of Topics | Train Log-Likelihood | Test Log-Likelihood | Train Perplexity | Test Perplexity |
---|---|---|---|---|
10 | -1.26e+07 | -3.27e+06 | 2700.88 | 3764.29 |
15 | -1.26e+07 | -3.28e+06 | 2722.52 | 3853.54 |
20 | -1.26e+07 | -3.28e+06 | 2694.63 | 3895.46 |
25 | -1.26e+07 | -3.29e+06 | 2716.94 | 3940.12 |
- Final model used 15 topics
- The following table shows the label for each topic that I assigned based on the most salient words, along with the 5 most salient words associated with each topic
Assigned Topic Name | #1 Word | #2 Word | #3 Word | #4 Word | #5 Word |
---|---|---|---|---|---|
General | life | tell | brain | human | talk |
Science | light | energy | earth | planet | space |
Technology | cell | brain | human | technology | system |
Politics | political | government | power | medium | war |
Problems | country | percent | change | company | problem |
Personal | day | tell | story | life | love |
AI | robot | machine | computer | build | game |
Miscellaneous | socket | tk | amputate | prosthesis | amputee |
Healthcare | patient | cancer | health | disease | doctor |
Linguistics/Healthcare | language | word | book | write | read |
Space | satellite | rocket | orbit | space | launch |
Agriculture/Nature | food | eat | plant | farmer | animal |
Gender/Sexuality | woman | man | sex | female | male |
Audio/Visual | music | sound | play | voice | hear |
Urban Planning/Design | city | design | building | build | place |
- Transformed entire data set using LDA model that was fit on training data
-
Dominant Topic: largest topic in a talk
-
Secondary Topic: second largest topic in a talk
-
Tertiary Topic: third largest topic in a talk
-
Used resulting document-topic matrix as the input to calculate Jensen-Shannon divergence, which takes in probability distributions, to calculate similarity between TED talks
- TED talks are usually framed around solving a personal problem
- TED talks are meant to appeal to a general audience
- Although TED has been around for several decades, around early 2000’s developed a strong, wide-known image
- Tokenizing and processing transcripts is error-prone
- Can get TED talk recommendations by using a dropdown to search for titles
- The app will then show you the topic distribution, summary, assigned TED tags, and link to watch the talk, as displayed in the gifs
- The app will then show the top 5 most similar and most similar talks, along with their summaries, tags, and links
- Word Embeddings & n-grams
- Recommend based on topic or keywords
- Applications of topic modeling
- Semantic analysis
- TED talks over time
- Predicting views or audience engagement
- Improved frontend