A review of the most popular topic modeling techniques.
This repository contains the code for hands-on sessions related to topic modeling. It is designed to help you understand the concepts and implementations of topic modeling techniques, including but not limited to LDA (Latent Dirichlet Allocation) and more advanced approaches based on word embeddings, such as BERTopic.
Before you begin, ensure you have the following software installed:
- Python 3.7 or higher
- Required Python libraries (listed below)
Clone the repository to your local machine:
git clone https://github.com/mauroIstat/word-embedding-tutorial.git
cd word-embedding-tutorial
You can install the required dependencies using pip. It is recommended to create a virtual environment before installing the packages.
pip install -r requirements.txt
The requirements.txt
file includes the following libraries:
- pandas
- numpy
- scikit-learn
- gensim
- matplotlib
- pyLDAvis
- bertopic
To run the notebooks or scripts for topic modeling:
- Download and preprocess the dataset (if not already available).
- Explore the code and try running different techniques for topic modeling.
- Use the provided Jupyter Notebooks or Python scripts for each part of the tutorial.
data/
: Sample datasets used for the tutorial.papers/
: Papers on Wordembedding techniques (Word2Vec & Glove).resources/
: An extended list of Italian stopword and the Italian .pickle file needed to tokenize text.src/
: Utility functions in python.
If you'd like to contribute to the repository, feel free to fork it and submit a pull request. Please make sure your code adheres to the existing coding standards and includes tests where necessary.
This project is licensed under the MIT License - see the LICENSE file for details.